Update TODO accordingly.
+++ /dev/null
-From fjoe@iclub.nsu.ru Tue Jan 23 03:38:45 2001
-Received: from mx.nsu.ru (root@mx.nsu.ru [])
- by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id DAA14458
- for <pgman@candle.pha.pa.us>; Tue, 23 Jan 2001 03:38:24 -0500 (EST)
-Received: from iclub.nsu.ru (root@iclub.nsu.ru [])
- by mx.nsu.ru (8.9.1/8.9.0) with ESMTP id OAA29153;
- Tue, 23 Jan 2001 14:31:27 +0600 (NOVT)
-Received: from localhost (fjoe@localhost)
- by iclub.nsu.ru (8.11.1/8.11.1) with ESMTP id f0N8VOr15273;
- Tue, 23 Jan 2001 14:31:25 +0600 (NS)
- (envelope-from fjoe@iclub.nsu.ru)
-Date: Tue, 23 Jan 2001 14:31:24 +0600 (NS)
-From: Max Khon <fjoe@iclub.nsu.ru>
-To: Bruce Momjian <pgman@candle.pha.pa.us>
-cc: PostgreSQL-development <pgsql-hackers@postgresql.org>
-Subject: Re: [HACKERS] Bug in FOREIGN KEY
-In-Reply-To: <200101230416.XAA04293@candle.pha.pa.us>
-Message-ID: <Pine.BSF.4.21.0101231429310.12474-100000@iclub.nsu.ru>
-MIME-Version: 1.0
-Content-Type: TEXT/PLAIN; charset=US-ASCII
-Status: RO
-hi, there!
-On Mon, 22 Jan 2001, Bruce Momjian wrote:
-> > This problem with foreign keys has been reported to me, and I have confirmed
-> > the bug exists in current sources. The DELETE should succeed:
-> >
-> > ---------------------------------------------------------------------------
-> >
-> > CREATE TABLE primarytest2 (
-> > col1 INTEGER,
-> > col2 INTEGER,
-> > PRIMARY KEY(col1, col2)
-> > );
-> >
-> > CREATE TABLE foreigntest2 (col3 INTEGER,
-> > col4 INTEGER,
-> > FOREIGN KEY (col3, col4) REFERENCES primarytest2
-> > );
-> > test=> BEGIN;
-> > BEGIN
-> > test=> INSERT INTO primarytest2 VALUES (5,5);
-> > INSERT 27618 1
-> > test=> DELETE FROM primarytest2 WHERE col1 = 5 AND col2 = 5;
-> > ERROR: triggered data change violation on relation "primarytest2"
-I have another (slightly different) example:
---- cut here ---
-test=> CREATE TABLE pr(obj_id int PRIMARY KEY);
-NOTICE: CREATE TABLE/PRIMARY KEY will create implicit index 'pr_pkey' for
-table 'pr'
-NOTICE: CREATE TABLE will create implicit trigger(s) for FOREIGN KEY
-test=> BEGIN;
-test=> INSERT INTO pr (obj_id) VALUES (1);
-INSERT 200539 1
-test=> INSERT INTO fr (obj_id) SELECT obj_id FROM pr;
-INSERT 200540 1
-test=> DELETE FROM fr;
-ERROR: triggered data change violation on relation "fr"
---- cut here ---
-we are running postgresql 7.1 beta3
-From sszabo@megazone23.bigpanda.com Tue Jan 23 13:41:55 2001
-Received: from megazone23.bigpanda.com (rfx-64-6-210-138.users.reflexcom.com [])
- by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id NAA19924
- for <pgman@candle.pha.pa.us>; Tue, 23 Jan 2001 13:41:54 -0500 (EST)
-Received: from localhost (sszabo@localhost)
- by megazone23.bigpanda.com (8.11.1/8.11.1) with ESMTP id f0NIfLa41018;
- Tue, 23 Jan 2001 10:41:21 -0800 (PST)
-Date: Tue, 23 Jan 2001 10:41:21 -0800 (PST)
-From: Stephan Szabo <sszabo@megazone23.bigpanda.com>
-To: Bruce Momjian <pgman@candle.pha.pa.us>
-cc: Jan Wieck <janwieck@Yahoo.com>, Peter Eisentraut <peter_e@gmx.net>,
- PostgreSQL-development <pgsql-hackers@postgresql.org>
-Subject: Re: [HACKERS] Bug in FOREIGN KEY
-In-Reply-To: <200101230417.XAA04332@candle.pha.pa.us>
-Message-ID: <Pine.BSF.4.21.0101231031290.40955-100000@megazone23.bigpanda.com>
-MIME-Version: 1.0
-Content-Type: TEXT/PLAIN; charset=US-ASCII
-Status: RO
-> > Think I misinterpreted the SQL3 specs WR to this detail. The
-> > checks must be made per statement, not at the transaction
-> > level. I'll try to fix it, but we need to define what will
-> > happen with referential actions in the case of conflicting
-> > actions on the same key - there are some possible conflicts:
-> >
-> >
-> > Do the referencing rows reference to the new PK row with
-> > the same key now, or is this still a constraint
-> > violation? I would say it's not, because the constraint
-> > condition is satisfied at the end of the transaction. How
-> > do other databases behave?
-> >
-> >
-> > Again I'd say that the action should be suppressed
-> > because a matching PK row is present at transaction end -
-> > it's not the same old row, but the constraint itself is
-> > still satisfied.
-I'm not actually sure on the cascade, set null and set default. The
-way they are written seems to imply to me that it's based on the state
-of the database before/after the command in question as opposed to the
-deferred state of the database because of the stuff about updating the
-state of partially matching rows immediately after the delete/update of
-the row which wouldn't really make sense when deferred. Does anyone know
-what other systems do with a case something like this all in a
-create table a (a int primary key);
-create table b (b int references a match full on update cascade
- on delete cascade deferrable initially deferred);
-insert into a values (1);
-insert into a values (2);
-insert into b values (1);
-delete from a where a=1;
-select * from b;
-From pgsql-hackers-owner+M3901@postgresql.org Fri Jan 26 17:00:24 2001
-Received: from mail.postgresql.org (webmail.postgresql.org [])
- by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id RAA10576
- for <pgman@candle.pha.pa.us>; Fri, 26 Jan 2001 17:00:24 -0500 (EST)
-Received: from mail.postgresql.org (webmail.postgresql.org [])
- by mail.postgresql.org (8.11.1/8.11.1) with SMTP id f0QLtVq53019;
- Fri, 26 Jan 2001 16:55:31 -0500 (EST)
- (envelope-from pgsql-hackers-owner+M3901@postgresql.org)
-Received: from smtp1b.mail.yahoo.com (smtp3.mail.yahoo.com [])
- by mail.postgresql.org (8.11.1/8.11.1) with SMTP id f0QLqmq52691
- for <pgsql-hackers@postgresql.org>; Fri, 26 Jan 2001 16:52:48 -0500 (EST)
- (envelope-from janwieck@yahoo.com)
-Received: from j13.us.greatbridge.com (HELO jupiter.greatbridge.com) (
- by smtp.mail.vip.suc.yahoo.com with SMTP; 26 Jan 2001 22:49:57 -0000
-X-Apparently-From: <janwieck@yahoo.com>
-Received: (from janwieck@localhost)
- by jupiter.greatbridge.com (8.9.3/8.9.3) id RAA04701;
- Fri, 26 Jan 2001 17:02:32 -0500
-From: Jan Wieck <janwieck@Yahoo.com>
-Message-Id: <200101262202.RAA04701@jupiter.greatbridge.com>
-Subject: Re: [HACKERS] Bug in FOREIGN KEY
-In-Reply-To: <200101262110.QAA06902@candle.pha.pa.us> from Bruce Momjian at "Jan
- 26, 2001 04:10:22 pm"
-To: Bruce Momjian <pgman@candle.pha.pa.us>
-Date: Fri, 26 Jan 2001 17:02:32 -0500 (EST)
-CC: Jan Wieck <janwieck@Yahoo.com>, Peter Eisentraut <peter_e@gmx.net>,
- PostgreSQL-development <pgsql-hackers@postgresql.org>
-X-Mailer: ELM [version 2.4ME+ PL68 (25)]
-MIME-Version: 1.0
-Content-Type: text/plain; charset=US-ASCII
-Content-Transfer-Encoding: 7bit
-Precedence: bulk
-Sender: pgsql-hackers-owner@postgresql.org
-Status: RO
-Bruce Momjian wrote:
-> Here is another bug:
-> test=> begin;
-> test=> INSERT INTO primarytest2 VALUES (5,5);
-> INSERT 18757 1
-> test=> UPDATE primarytest2 SET col2=1 WHERE col1 = 5 AND col2 = 5;
-> ERROR: deferredTriggerGetPreviousEvent: event for tuple (0,10) not
-> found
- Schema?
-# It's easier to get forgiveness for being wrong than for being right. #
-# Let's break this rule - forgive me. #
-#================================================== JanWieck@Yahoo.com #
-Do You Yahoo!?
-Get your free @yahoo.com address at http://mail.yahoo.com
-From pgsql-hackers-owner+M3864@postgresql.org Fri Jan 26 10:07:36 2001
-Received: from mail.postgresql.org (webmail.postgresql.org [])
- by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id KAA17732
- for <pgman@candle.pha.pa.us>; Fri, 26 Jan 2001 10:07:35 -0500 (EST)
-Received: from mail.postgresql.org (webmail.postgresql.org [])
- by mail.postgresql.org (8.11.1/8.11.1) with SMTP id f0QF3lq12782;
- Fri, 26 Jan 2001 10:03:47 -0500 (EST)
- (envelope-from pgsql-hackers-owner+M3864@postgresql.org)
-Received: from mailout00.sul.t-online.com (mailout00.sul.t-online.com [])
- by mail.postgresql.org (8.11.1/8.11.1) with ESMTP id f0QF0Yq12614
- for <pgsql-hackers@postgresql.org>; Fri, 26 Jan 2001 10:00:34 -0500 (EST)
- (envelope-from peter_e@gmx.net)
-Received: from fwd01.sul.t-online.com
- by mailout00.sul.t-online.com with smtp
- id 14MALp-0006Im-00; Fri, 26 Jan 2001 15:59:45 +0100
-Received: from peter.localdomain (520083510237-0001@[]) by fmrl01.sul.t-online.com
- with esmtp id 14MALQ-1Z0gkaC; Fri, 26 Jan 2001 15:59:20 +0100
-Date: Fri, 26 Jan 2001 16:07:27 +0100 (CET)
-From: Peter Eisentraut <peter_e@gmx.net>
-To: Hiroshi Inoue <Inoue@tpf.co.jp>
-cc: Bruce Momjian <pgman@candle.pha.pa.us>,
- PostgreSQL-development <pgsql-hackers@postgresql.org>
-Subject: Re: [HACKERS] Open 7.1 items
-In-Reply-To: <3A70FA87.933B3D51@tpf.co.jp>
-Message-ID: <Pine.LNX.4.30.0101261604030.769-100000@peter.localdomain>
-MIME-Version: 1.0
-Content-Type: TEXT/PLAIN; charset=US-ASCII
-X-Sender: 520083510237-0001@t-dialin.net
-Precedence: bulk
-Sender: pgsql-hackers-owner@postgresql.org
-Status: RO
-Hiroshi Inoue writes:
-> What does this item mean ?
-> Is it the following ?
-> begin;
-> insert into pk (id) values (1);
-> update(delete from) pk where id=1;
-> ERROR: triggered data change violation on relation pk"
-> If so, isn't it a simple bug ?
-Depends on the definition of "bug". It's not spec compliant and it's not
-documented and it's annoying. But it's been like this for a year and the
-issue is well known and can normally be avoided. It looks like a
-documentation to-do to me.
-Peter Eisentraut peter_e@gmx.net http://yi.org/peter-e/
-From pgsql-hackers-owner+M3876@postgresql.org Fri Jan 26 13:07:10 2001
-Received: from mail.postgresql.org (webmail.postgresql.org [])
- by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id NAA26086
- for <pgman@candle.pha.pa.us>; Fri, 26 Jan 2001 13:07:09 -0500 (EST)
-Received: from mail.postgresql.org (webmail.postgresql.org [])
- by mail.postgresql.org (8.11.1/8.11.1) with SMTP id f0QI4Vq30248;
- Fri, 26 Jan 2001 13:04:31 -0500 (EST)
- (envelope-from pgsql-hackers-owner+M3876@postgresql.org)
-Received: from sectorbase2.sectorbase.com ([])
- by mail.postgresql.org (8.11.1/8.11.1) with SMTP id f0QI3Aq30098
- for <pgsql-hackers@postgreSQL.org>; Fri, 26 Jan 2001 13:03:11 -0500 (EST)
- (envelope-from vmikheev@SECTORBASE.COM)
-Received: by sectorbase2.sectorbase.com with Internet Mail Service (5.5.2653.19)
- id <D49FAF71>; Fri, 26 Jan 2001 09:41:23 -0800
-Message-ID: <8F4C99C66D04D4118F580090272A7A234D32C1@sectorbase1.sectorbase.com>
-From: "Mikheev, Vadim" <vmikheev@SECTORBASE.COM>
-To: "'Jan Wieck'" <janwieck@Yahoo.com>,
- <pgsql-hackers@postgresql.org>,
- Bruce Momjian <root@candle.pha.pa.us>
-Subject: RE: [HACKERS] Open 7.1 items
-Date: Fri, 26 Jan 2001 10:02:59 -0800
-MIME-Version: 1.0
-X-Mailer: Internet Mail Service (5.5.2653.19)
-Content-Type: text/plain;
- charset="iso-8859-1"
-Precedence: bulk
-Sender: pgsql-hackers-owner@postgresql.org
-Status: RO
-> > FOREIGN KEY INSERT & UPDATE/DELETE in transaction "change violation"
-> A well known issue, and I've asked multiple times how exactly
-> we want to define the behaviour for deferred constraints. Do
-> foreign keys reference just to a key value and are happy with
-> it's existance, or do they refer to a particular row?
-I think first. The last is closer to OODBMS world, not to [O]RDBMS one.
-> Consider you have a deferred "ON DELETE CASCADE" constraint
-> and do a DELETE, INSERT of a PK. Do the FK rows need to be
-> deleted or not?
-Good example. I think FK should not be deleted. If someone really
-want to delete "old" FK then he can do
-SET CONSTRAINT ... IMMEDIATE; -- FK need to be deleted here
-> Consider you have a deferred "ON DELETE RESTRICT" and "ON
-> UPDATE CASCADE" constraint. If you DELETE PK1 and UPDATE PK2
-> to PK1, the FK2 rows need to follow, but does PK2 inherit all
-> FK1 rows now so it's the master of both groups?
-Yes. Again one can use SET CONSTRAINT to achieve desirable results.
-It seems that SET CONSTRAINT was designed for these purposes - ie
-for better flexibility.
-Though, it would be better to look how other DBes handle all these
-cases -:)
-From janwieck@yahoo.com Fri Jan 26 12:20:27 2001
-Received: from smtp6.mail.yahoo.com (smtp6.mail.yahoo.com [])
- by candle.pha.pa.us (8.9.0/8.9.0) with SMTP id MAA22158
- for <root@candle.pha.pa.us>; Fri, 26 Jan 2001 12:20:27 -0500 (EST)
-Received: from j13.us.greatbridge.com (HELO jupiter.greatbridge.com) (
- by smtp.mail.vip.suc.yahoo.com with SMTP; 26 Jan 2001 17:20:26 -0000
-X-Apparently-From: <janwieck@yahoo.com>
-Received: (from janwieck@localhost)
- by jupiter.greatbridge.com (8.9.3/8.9.3) id MAA03196;
- Fri, 26 Jan 2001 12:30:05 -0500
-From: Jan Wieck <janwieck@yahoo.com>
-Message-Id: <200101261730.MAA03196@jupiter.greatbridge.com>
-Subject: Re: [HACKERS] Open 7.1 items
-To: PostgreSQL HACKERS <pgsql-hackers@postgreSQL.org>,
- Bruce Momjian <root@candle.pha.pa.us>
-Date: Fri, 26 Jan 2001 12:30:05 -0500 (EST)
-X-Mailer: ELM [version 2.4ME+ PL68 (25)]
-MIME-Version: 1.0
-Content-Type: text/plain; charset=US-ASCII
-Content-Transfer-Encoding: 7bit
-Status: RO
-Bruce Momjian wrote:
-> Here are my open 7.1 items. Thanks for shrinking the list so far.
-> ---------------------------------------------------------------------------
-> FreeBSD locale bug
-> Reorder INSERT firing in rules
- I don't recall why this is wanted. AFAIK there's no reason
- NOT to do so, except for the actual state of beeing far too
- close to a release candidate.
-> Philip Warner UPDATE crash
-> JDBC LargeObject short read return value missing
-> SELECT cash_out(1) crashes all backends
-> FOREIGN KEY INSERT & UPDATE/DELETE in transaction "change violation"
- A well known issue, and I've asked multiple times how exactly
- we want to define the behaviour for deferred constraints. Do
- foreign keys reference just to a key value and are happy with
- it's existance, or do they refer to a particular row?
- Consider you have a deferred "ON DELETE CASCADE" constraint
- and do a DELETE, INSERT of a PK. Do the FK rows need to be
- deleted or not?
- Consider you have a deferred "ON DELETE RESTRICT" and "ON
- UPDATE CASCADE" constraint. If you DELETE PK1 and UPDATE PK2
- to PK1, the FK2 rows need to follow, but does PK2 inherit all
- FK1 rows now so it's the master of both groups?
- These are only two possible combinations. There are many to
- think of. As said, I've asked before, but noone voted yet.
- Move the item to 7.2 anyway, because changing this behaviour
- would require massive changes in the trigger queue *and* the
- generic RI triggers, which cannot be tested enough any more.
-> Usernames limited in length
-> Does pg_dump preserve COMMENTs?
-> Failure of nested cursors in JDBC
-> JDBC setMaxRows() is global variable affecting other objects
-> Does JDBC Makefile need current dir?
-> Fix for pg_dump of bad system tables
-> Steve Howe failure query with rules
-> ODBC/JDBC not disconnecting properly?
-> Magnus Hagander ODBC issues?
-> Merge MySQL/PgSQL translation scripts
-> Fix ipcclean on Linux
-> Merge global and template BKI files?
-> --
-> Bruce Momjian | http://candle.pha.pa.us
-> pgman@candle.pha.pa.us | (610) 853-3000
-> + If your life is a hard drive, | 830 Blythe Avenue
-> + Christ can be your backup. | Drexel Hill, Pennsylvania 19026
-# It's easier to get forgiveness for being wrong than for being right. #
-# Let's break this rule - forgive me. #
-#================================================== JanWieck@Yahoo.com #
-Do You Yahoo!?
-Get your free @yahoo.com address at http://mail.yahoo.com
-From pgsql-general-owner+M590@postgresql.org Tue Nov 14 16:30:40 2000
-Received: from mail.postgresql.org (webmail.postgresql.org [])
- by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id RAA22313
- for <pgman@candle.pha.pa.us>; Tue, 14 Nov 2000 17:30:39 -0500 (EST)
-Received: from mail.postgresql.org (webmail.postgresql.org [])
- by mail.postgresql.org (8.11.1/8.11.1) with SMTP id eAEMSJs66979;
- Tue, 14 Nov 2000 17:28:21 -0500 (EST)
- (envelope-from pgsql-general-owner+M590@postgresql.org)
-Received: from megazone23.bigpanda.com ( [])
- by mail.postgresql.org (8.11.1/8.11.1) with ESMTP id eAEMREs66800
- for <pgsql-general@postgresql.org>; Tue, 14 Nov 2000 17:27:14 -0500 (EST)
- (envelope-from sszabo@megazone23.bigpanda.com)
-Received: from localhost (sszabo@localhost)
- by megazone23.bigpanda.com (8.11.1/8.11.0) with ESMTP id eAEMPpH69059;
- Tue, 14 Nov 2000 14:25:51 -0800 (PST)
-Date: Tue, 14 Nov 2000 14:25:51 -0800 (PST)
-From: Stephan Szabo <sszabo@megazone23.bigpanda.com>
-To: "Beth K. Gatewood" <bethg@mbt.washington.edu>
-cc: pgsql-general@postgresql.org
-Subject: Re: [GENERAL] a request for some experienced input.....
-In-Reply-To: <3A11ACA1.E5D847DD@mbt.washington.edu>
-Message-ID: <Pine.BSF.4.21.0011141403380.68986-100000@megazone23.bigpanda.com>
-MIME-Version: 1.0
-Content-Type: TEXT/PLAIN; charset=US-ASCII
-Precedence: bulk
-Sender: pgsql-general-owner@postgresql.org
-Status: OR
-On Tue, 14 Nov 2000, Beth K. Gatewood wrote:
-> >
-> Stephan-
-> Thank you so much for taking the effort to answer this these questions. You
-> help is truly appreciated....
-> I just have a few points for clarification.
-> >
-> > MATCH PARTIAL is a specific match type which describes which rows are
-> > considered matching rows for purposes of meeting or failing the
-> > constraint. (In match partial, a fktable (NULL, 2) would match a pk
-> > table (1,2) as well as a pk table (2,2). It's different from match
-> > full in which case (NULL,2) would be invalid or match unspecified
-> > in which case it would match due to the existance of the NULL in any
-> > case). There are some bizarre implementation details involved with
-> > it and it's different from the others in ways that make it difficult.
-> > It's in my list of things to do, but I haven't come up with an acceptable
-> > mechanism in my head yet.
-> Does this mean, currently that I can not have foreign keys with null values?
-Not exactly...
-Match full = In FK row, all columns must be NULL or the value of each
- column must not be null and there is a row in the PK table where
- each referencing column equals the corresponding referenced
- column.
-Unspecified = In FK row, at least one column must be NULL or each
- referencing column shall be equal to the corresponding referenced
- column in some row of the referenced table
-Match partial is similar to match full except we ignore the null columns
- for purposes of the each referencing column equals bit.
-For example:
- PK Table Key values: (1,2), (1,3), (3,3)
- Attempted FK Table Key values: (1,2), (1,NULL), (5,NULL), (NULL, NULL)
- (hopefully I get this right)...
- In match full, only the 1st and 4th fk values are valid.
- In match partial, the 1st, 2nd, and 4th fk values are valid.
- In match unspecified, all the fk values are valid.
-The other note is that generally speaking, all three are basically the
-same for the single column key. If you're only doing references on one
-column, the match type is mostly meaningless.
-> > PENDANT adds that for each row of the referenced table the values of
-> > the specified column(s) are the same as the values of the specified
-> > column(s) in some row of the referencing tables.
-> I am not sure I know what you mean here.....Are you saying that the value for
-> the FK column must match the value for the PK column?
-I haven't really looked at PENDANT, the above was just a small rewrite of
-some descriptive text in the sql99 draft I have. There's a whole bunch
-of rules in the actual text of the referential constraint definition.
-The base stuff seems to be: (Rf is the referencing columns, T is the
-referenced table)
- 3) If PENDANT is specified, then:
- a) For a given row in the referencing table, let pendant
- reference designate an instance in which all Rf are
- non-null.
- b) Let number of pendant paths be the number of pendant
- references to the same referenced row in a referenced table
- from all referencing rows in all base tables.
- c) For every row in T, the number of pendant paths is equal to
- or greater than 1.
-So, I'd read it as every row in T must have at least one referencing row
-in some base table.
-There are some details about updates and that you can't mix PENDANT and
-> > The main issues in 7.0 are that older versions (might be fixed in
-> > 7.0.3) would fail very badly if you used alter table to rename tables that
-> > were referenced in a fk constraint and that you need to give update
-> > permission to the referenced table. For the former, 7.1 will (and 7.0.3
-> > may) give an elog(ERROR) to you rather than crashing the backend and the
-> > latter should be fixed for 7.1 (although you still need to have write
-> > perms to the referencing table for referential actions to work properly)
-> Are the steps to this outlined somewhere then?
-The permissions stuff is just a matter of using GRANT and REVOKE to set
-the permissions that a user has to a table.
+++ /dev/null
-From pgsql-hackers-owner+M908@postgresql.org Sun Nov 19 14:27:43 2000
-Received: from mail.postgresql.org (webmail.postgresql.org [])
- by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id OAA10885
- for <pgman@candle.pha.pa.us>; Sun, 19 Nov 2000 14:27:42 -0500 (EST)
-Received: from mail.postgresql.org (webmail.postgresql.org [])
- by mail.postgresql.org (8.11.1/8.11.1) with SMTP id eAJJSMs83653;
- Sun, 19 Nov 2000 14:28:22 -0500 (EST)
- (envelope-from pgsql-hackers-owner+M908@postgresql.org)
-Received: from candle.pha.pa.us (candle.navpoint.com [] (may be forged))
- by mail.postgresql.org (8.11.1/8.11.1) with ESMTP id eAJJQns83565
- for <pgsql-hackers@postgreSQL.org>; Sun, 19 Nov 2000 14:26:49 -0500 (EST)
- (envelope-from pgman@candle.pha.pa.us)
-Received: (from pgman@localhost)
- by candle.pha.pa.us (8.9.0/8.9.0) id OAA06790;
- Sun, 19 Nov 2000 14:23:06 -0500 (EST)
-From: Bruce Momjian <pgman@candle.pha.pa.us>
-Message-Id: <200011191923.OAA06790@candle.pha.pa.us>
-Subject: Re: [HACKERS] WAL fsync scheduling
-In-Reply-To: <002101c0525e$2d964480$b97a30d0@sectorbase.com> "from Vadim Mikheev
- at Nov 19, 2000 11:23:19 am"
-To: Vadim Mikheev <vmikheev@sectorbase.com>
-Date: Sun, 19 Nov 2000 14:23:06 -0500 (EST)
-CC: Tom Samplonius <tom@sdf.com>, Alfred@candle.pha.pa.us,
- Perlstein <bright@wintelcom.net>, Larry@candle.pha.pa.us,
- Rosenman <ler@lerctr.org>,
- PostgreSQL-development <pgsql-hackers@postgresql.org>
-X-Mailer: ELM [version 2.4ME+ PL77 (25)]
-MIME-Version: 1.0
-Content-Transfer-Encoding: 7bit
-Content-Type: text/plain; charset=US-ASCII
-Precedence: bulk
-Sender: pgsql-hackers-owner@postgresql.org
-Status: OR
-[ Charset ISO-8859-1 unsupported, converting... ]
-> > There are two parts to transaction commit. The first is writing all
-> > dirty buffers or log changes to the kernel, and second is fsync of the
-> ^^^^^^^^^^^^
-> Backend doesn't write any dirty buffer to the kernel at commit time.
-Yes, I suspected that.
-> > log file.
-> The first part is writing commit record into WAL buffers in shmem.
-> This is what XLogInsert does. After that XLogFlush is called to ensure
-> that entire commit record is on disk. XLogFlush does *both* write() and
-> fsync() (single slock is used for both writing and fsyncing) if it needs to
-> do it at all.
-Yes, I realize there are new steps in WAL.
-> > I suggest having a per-backend shared memory byte that has the following
-> > values:
-> >
-> > backend_number_doing_fsync
-> >
-> > I suggest that when each backend starts a commit, it sets its byte to
-> ^^^^^^^^^^^^^^^^^^^^^^^
-> Isn't START_COMMIT more meaningful?
-> > When it gets ready to fsync, it checks all backends.
-> ^^^^^^^^^^^^^^^^^^^^^^^^^^
-> What do you mean by this? The moment just after XLogInsert?
-Just before it calls fsync().
-> > If all are NOT_IN_COMMIT, it does fsync and continues.
-> 1st edition:
-> > If one or more are in START_LOG_WRITE, it waits until no one is in
-> > START_LOG_WRITE. It then checks all WAIT_ON_FSYNC, and if it is the
-> > lowest backend in WAIT_ON_FSYNC, marks all others with its backend
-> > number, and does fsync. It then clears all backends with its number to
-> > NOT_IN_COMMIT. Other backend will see they are not the lowest
-> > WAIT_ON_FSYNC and will wait for their byte to be set to NOT_IN_COMMIT
-> > so they can then continue, knowing their data was synced.
-> 2nd edition:
-> > I have another idea. If a backend gets to the point that it needs
-> > fsync, and there is another backend in START_LOG_WRITE, it can go to an
-> > interuptable sleep, knowing another backend will perform the fsync and
-> > wake it up. Therefore, there is no busy-wait or timed sleep.
-> >
-> > Of course, a backend must set its status to WAIT_ON_FSYNC to avoid a
-> > race condition.
-> The 2nd edition is much better. But I'm not sure do we really need in
-> these per-backend bytes in shmem. Why not just have some counters?
-> We can use a semaphore to wake-up all waiters at once.
-Yes, that is much better and clearer. My idea was just to say, "if no
-one is entering commit phase, do the commit. If someone else is coming,
-sleep and wait for them to do the fsync and wake me up with a singal."
-> > This allows a single backend not to sleep, and allows multiple backends
-> > to bunch up only when they are all about to commit.
-> >
-> > The reason backend numbers are written is so other backends entering the
-> > commit code will not interfere with the backends performing fsync.
-> Being waked-up backend can check what's written/fsynced by calling XLogFlush.
-Seems that may not be needed anymore with a counter. The only issue is
-that other backends may enter commit while fsync() is happening. The
-process that did the fsync must be sure to wake up only the backends
-that were waiting for it, and not other backends that may be also be
-doing fsync as a group while the first fsync was happening. I leave
-those details to people more experienced. :-)
-I am just glad people liked my idea.
- Bruce Momjian | http://candle.pha.pa.us
- pgman@candle.pha.pa.us | (610) 853-3000
- + If your life is a hard drive, | 830 Blythe Avenue
- + Christ can be your backup. | Drexel Hill, Pennsylvania 19026
+++ /dev/null
-From owner-pgsql-hackers@hub.org Mon Mar 22 18:43:41 1999
-Received: from renoir.op.net (root@renoir.op.net [])
- by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id SAA23978
- for <maillist@candle.pha.pa.us>; Mon, 22 Mar 1999 18:43:39 -0500 (EST)
-Received: from hub.org (majordom@hub.org []) by renoir.op.net (o1/$ Revision: 1.18 $) with ESMTP id SAA06472 for <maillist@candle.pha.pa.us>; Mon, 22 Mar 1999 18:36:44 -0500 (EST)
-Received: from localhost (majordom@localhost)
- by hub.org (8.9.2/8.9.1) with SMTP id SAA92604;
- Mon, 22 Mar 1999 18:34:23 -0500 (EST)
- (envelope-from owner-pgsql-hackers@hub.org)
-Received: by hub.org (TLB v0.10a (1.23 tibbs 1997/01/09 00:29:32)); Mon, 22 Mar 1999 18:33:50 +0000 (EST)
-Received: (from majordom@localhost)
- by hub.org (8.9.2/8.9.1) id SAA92469
- for pgsql-hackers-outgoing; Mon, 22 Mar 1999 18:33:47 -0500 (EST)
- (envelope-from owner-pgsql-hackers@postgreSQL.org)
-Received: from po8.andrew.cmu.edu (PO8.ANDREW.CMU.EDU [])
- by hub.org (8.9.2/8.9.1) with ESMTP id SAA92456
- for <pgsql-hackers@postgresql.org>; Mon, 22 Mar 1999 18:33:41 -0500 (EST)
- (envelope-from er1p+@andrew.cmu.edu)
-Received: (from postman@localhost) by po8.andrew.cmu.edu (8.8.5/8.8.2) id SAA12894 for pgsql-hackers@postgresql.org; Mon, 22 Mar 1999 18:33:38 -0500 (EST)
-Received: via switchmail; Mon, 22 Mar 1999 18:33:38 -0500 (EST)
-Received: from cloudy.me.cmu.edu via qmail
- ID </afs/andrew.cmu.edu/service/mailqs/q007/QF.Aqxh7Lu00gNtQ0TZE5>;
- Mon, 22 Mar 1999 18:27:20 -0500 (EST)
-Received: from cloudy.me.cmu.edu via qmail
- ID </afs/andrew.cmu.edu/usr2/er1p/.Outgoing/QF.Uqxh7JS00gNtMmTJFk>;
- Mon, 22 Mar 1999 18:27:17 -0500 (EST)
-Received: from mms.4.60.Jun.27.1996.03.05.56.sun4.41.EzMail.2.0.CUILIB.3.45.SNAP.NOT.LINKED.cloudy.me.cmu.edu.sun4m.412
- via MS.5.6.cloudy.me.cmu.edu.sun4_41;
- Mon, 22 Mar 1999 18:27:15 -0500 (EST)
-Message-ID: <sqxh7H_00gNtAmTJ5Q@andrew.cmu.edu>
-Date: Mon, 22 Mar 1999 18:27:15 -0500 (EST)
-From: Erik Riedel <riedel+@CMU.EDU>
-To: pgsql-hackers@postgreSQL.org
-Subject: [HACKERS] optimizer and type question
-Sender: owner-pgsql-hackers@postgreSQL.org
-Precedence: bulk
-Status: RO
-[last week aggregation, this week, the optimizer]
-I have a somewhat general optimizer question/problem that I would like
-to get some input on - i.e. I'd like to know what is "supposed" to
-work here and what I should be expecting. Sadly, I think the patch
-for this is more involved than my last message.
-Using my favorite table these days:
-Table = lineitem
-| Field | Type | Length|
-| l_orderkey | int4 not null | 4 |
-| l_partkey | int4 not null | 4 |
-| l_suppkey | int4 not null | 4 |
-| l_linenumber | int4 not null | 4 |
-| l_quantity | float4 not null | 4 |
-| l_extendedprice | float4 not null | 4 |
-| l_discount | float4 not null | 4 |
-| l_tax | float4 not null | 4 |
-| l_returnflag | char() not null | 1 |
-| l_linestatus | char() not null | 1 |
-| l_shipdate | date | 4 |
-| l_commitdate | date | 4 |
-| l_receiptdate | date | 4 |
-| l_shipinstruct | char() not null | 25 |
-| l_shipmode | char() not null | 10 |
-| l_comment | char() not null | 44 |
-Index: lineitem_index_
-and the query:
--- Query 1
-explain select l_returnflag, l_linestatus, sum(l_quantity) as sum_qty,
-sum(l_extendedprice) as sum_base_price,
-sum(l_extendedprice*(1-l_discount)) as sum_disc_price,
-sum(l_extendedprice*(1-l_discount)*(1+l_tax)) as sum_charge,
-avg(l_quantity) as avg_qty, avg(l_extendedprice) as avg_price,
-avg(l_discount) as avg_disc, count(*) as count_order
-from lineitem
-where l_shipdate <= '1998-09-02'::date
-group by l_returnflag, l_linestatus
-order by l_returnflag, l_linestatus;
-note that I have eliminated the date calculation in my query of last
-week and manually replaced it with a constant (since this wasn't
-happening automatically - but let's not worry about that for now).
-And this is only an explain, we care about the optimizer. So we get:
-Sort (cost=34467.88 size=0 width=0)
- -> Aggregate (cost=34467.88 size=0 width=0)
- -> Group (cost=34467.88 size=0 width=0)
- -> Sort (cost=34467.88 size=0 width=0)
- -> Seq Scan on lineitem (cost=34467.88 size=200191 width=44)
-so let's think about the selectivity that is being chosen for the
-seq scan (the where l_shipdate <= '1998-09-02').
-Turns out the optimizer is choosing "33%", even though the real answer
-is somewhere in 90+% (that's how the query is designed). So, why does
-it do that?
-Turns out that selectivity in this case is determined via
-plancat::restriction_selectivity() which calls into functionOID = 103
-(intltsel) for operatorOID = 1096 (date "<=") on relation OID = 18663
-(my lineitem).
-This all follows because of the description of 1096 (date "<=") in
-pg_operator. Looking at local1_template1.bki.source near line 1754
-insert OID = 1096 ( "<=" PGUID 0 <...> date_le intltsel intltjoinsel )
-where we see that indeed, it thinks "intltsel" is the right function
-to use for "oprrest" in the case of dates.
-Question 1 - is intltsel the right thing for selectivity on dates?
-Hope someone is still with me.
-So now we're running selfuncs::intltsel() where we make a further call
-to selfuncs::gethilokey(). The job of gethilokey is to determine the
-min and max values of a particular attribute in the table, which will
-then be used with the constant in my where clause to estimate the
-selectivity. It is going to search the pg_statistic relation with
-three key values:
-Anum_pg_statistic_starelid 18663 (lineitem)
-Anum_pg_statistic_staattnum 11 (l_shipdate)
-Anum_pg_statistic_staop 1096 (date "<=")
-this finds no tuples in pg_statistic. Why is that? The only nearby
-tuple in pg_statistic is:
-starelid|staattnum|staop|stalokey |stahikey
- 18663| 11| 0|01-02-1992 |12-01-1998
-and the reason the query doesn't match anything? Because 1096 != 0.
-But why is it 0 in pg_statistic? Statistics are determined near line
-1844 in vacuum.c (assuming a 'vacuum analyze' run at some point)
- i = 0;
- values[i++] = (Datum) relid; /* 1 */
- values[i++] = (Datum) attp->attnum; /* 2 */
-====> values[i++] = (Datum) InvalidOid; /* 3 */
- fmgr_info(stats->outfunc, &out_function);
- out_string = <...min...>
- values[i++] = (Datum) fmgr(F_TEXTIN, out_string);
- pfree(out_string);
- out_string = <...max...>
- values[i++] = (Datum) fmgr(F_TEXTIN, out_string);
- pfree(out_string);
- stup = heap_formtuple(sd->rd_att, values, nulls);
-the "offending" line is setting the staop to InvalidOid (i.e. 0).
-Question 2 - is this right? Is the intent for 0 to serve as a
-"wildcard", or should it be inserting an entry for each operation
-In the case of "wildcard" then gethilokey() should allow a match for
-Anum_pg_statistic_staop 0
-instead of requiring the more restrictive 1096. In the current code,
-what happens next is gethilokey() returns "not found" and intltsel()
-returns the default 1/3 which I see in the resultant query plan (size
-= 200191 is 1/3 of the number of lineitem tuples).
-Question 3 - is there any inherent reason it couldn't get this right?
-The statistic is in the table 1992 to 1998, so the '1998-09-02' date
-should be 90-some% selectivity, a much better guess than 33%.
-Doesn't make a difference for this particular query, of course,
-because the seq scan must proceed anyhow, but it could easily affect
-other queries where selectivities matter (and it affects the
-modifications I am trying to test in the optimizer to be "smarter"
-about selectivities - my overall context is to understand/improve the
-behavior that the underlying storage system sees from queries like this).
-OK, so let's say we treat 0 as a "wildcard" and stop checking for
-1096. Not we let gethilokey() return the two dates from the statistic
-table. The immediate next thing that intltsel() does, near lines 122
-in selfuncs.c is call atol() on the strings from gethilokey(). And
-guess what it comes up with?
-low = 1
-high = 12
-because it calls atol() on '01-02-1992' and '12-01-1998'. This
-clearly isn't right, it should get some large integer that includes
-the year and day in the result. Then it should compare reasonably
-with my constant from the where clause and give a decent selectivity
-value. This leads to a re-visit of Question 1.
-Question 4 - should date "<=" use a dateltsel() function instead of
-intltsel() as oprrest?
-If anyone is still with me, could you tell me if this makes sense, or
-if there is some other location where the appropriate type conversion
-could take place so that intltsel() gets something reasonable when it
-does the atol() calls?
-Could someone also give me a sense for how far out-of-whack the whole
-current selectivity-handling structure is? It seems that most of the
-operators in pg_operator actually use intltsel() and would have
-type-specific problems like that described. Or is the problem in the
-way attribute values are stored in pg_statistic by vacuum analyze? Or
-is there another layer where type conversion belongs?
-Phew. Enough typing, hope someone can follow this and address at
-least some of the questions.
-Erik Riedel
-Carnegie Mellon University
-From owner-pgsql-hackers@hub.org Mon Mar 22 20:31:11 1999
-Received: from renoir.op.net (root@renoir.op.net [])
- by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id UAA00802
- for <maillist@candle.pha.pa.us>; Mon, 22 Mar 1999 20:31:09 -0500 (EST)
-Received: from hub.org (majordom@hub.org []) by renoir.op.net (o1/$ Revision: 1.18 $) with ESMTP id UAA13231 for <maillist@candle.pha.pa.us>; Mon, 22 Mar 1999 20:15:20 -0500 (EST)
-Received: from localhost (majordom@localhost)
- by hub.org (8.9.2/8.9.1) with SMTP id UAA01981;
- Mon, 22 Mar 1999 20:14:04 -0500 (EST)
- (envelope-from owner-pgsql-hackers@hub.org)
-Received: by hub.org (TLB v0.10a (1.23 tibbs 1997/01/09 00:29:32)); Mon, 22 Mar 1999 20:13:32 +0000 (EST)
-Received: (from majordom@localhost)
- by hub.org (8.9.2/8.9.1) id UAA01835
- for pgsql-hackers-outgoing; Mon, 22 Mar 1999 20:13:28 -0500 (EST)
- (envelope-from owner-pgsql-hackers@postgreSQL.org)
-Received: from sss.sss.pgh.pa.us (sss.pgh.pa.us [])
- by hub.org (8.9.2/8.9.1) with ESMTP id UAA01822
- for <pgsql-hackers@postgreSQL.org>; Mon, 22 Mar 1999 20:13:21 -0500 (EST)
- (envelope-from tgl@sss.pgh.pa.us)
-Received: from sss.sss.pgh.pa.us (localhost [])
- by sss.sss.pgh.pa.us (8.9.1/8.9.1) with ESMTP id UAA23294;
- Mon, 22 Mar 1999 20:12:43 -0500 (EST)
-To: Erik Riedel <riedel+@CMU.EDU>
-cc: pgsql-hackers@postgreSQL.org
-Subject: Re: [HACKERS] optimizer and type question
-In-reply-to: Your message of Mon, 22 Mar 1999 18:27:15 -0500 (EST)
- <sqxh7H_00gNtAmTJ5Q@andrew.cmu.edu>
-Date: Mon, 22 Mar 1999 20:12:43 -0500
-Message-ID: <23292.922151563@sss.pgh.pa.us>
-From: Tom Lane <tgl@sss.pgh.pa.us>
-Sender: owner-pgsql-hackers@postgreSQL.org
-Precedence: bulk
-Status: ROr
-Erik Riedel <riedel+@CMU.EDU> writes:
-> [ optimizer doesn't find relevant pg_statistic entry ]
-It's clearly a bug that the selectivity code is not finding this tuple.
-If your analysis is correct, then selectivity estimation has *never*
-worked properly, or at least not in recent memory :-(. Yipes.
-Bruce and I found a bunch of other problems in the optimizer recently,
-so it doesn't faze me to assume that this is broken too.
-> the "offending" line is setting the staop to InvalidOid (i.e. 0).
-> Question 2 - is this right? Is the intent for 0 to serve as a
-> "wildcard",
-My thought is that what the staop column ought to be is the OID of the
-comparison function that was used to determine the sort order of the
-column. Without a sort op the lowest and highest keys in the column are
-not well defined, so it makes no sense to assert "these are the lowest
-and highest values" without providing the sort op that determined that.
-(For sufficiently complex data types one could reasonably have multiple
-ordering operators. A crude example is sorting on "circumference" and
-"area" for polygons.) But typically the sort op will be the "<"
-operator for the column data type.
-So, the vacuum code is definitely broken --- it's not storing the sort
-op that it used. The code in gethilokey might be broken too, depending
-on how it is producing the operator it's trying to match against the
-tuple. For example, if the actual operator in the query is any of
-< <= > >= on int4, then int4lt ought to be used to probe the pg_statistic
-table. I'm not sure if we have adequate info in pg_operator or pg_type
-to let the optimizer code determine the right thing to probe with :-(
-> The immediate next thing that intltsel() does, near lines 122
-> in selfuncs.c is call atol() on the strings from gethilokey(). And
-> guess what it comes up with?
-> low = 1
-> high = 12
-> because it calls atol() on '01-02-1992' and '12-01-1998'. This
-> clearly isn't right, it should get some large integer that includes
-> the year and day in the result. Then it should compare reasonably
-> with my constant from the where clause and give a decent selectivity
-> value. This leads to a re-visit of Question 1.
-> Question 4 - should date "<=" use a dateltsel() function instead of
-> intltsel() as oprrest?
-This is clearly busted as well. I'm not sure that creating dateltsel()
-is the right fix, however, because if you go down that path then every
-single datatype needs its own selectivity function; that's more than we
-What we really want here is to be able to map datatype values into
-some sort of numeric range so that we can compute what fraction of the
-low-key-to-high-key range is on each side of the probe value (the
-constant taken from the query). This general concept will apply to
-many scalar types, so what we want is a type-specific mapping function
-and a less-specific fraction-computing-function. Offhand I'd say that
-we want intltsel() and floatltsel(), plus conversion routines that can
-produce either int4 or float8 from a data type as seems appropriate.
-Anything that couldn't map to one or the other would have to supply its
-own selectivity function.
-> Or is the problem in the
-> way attribute values are stored in pg_statistic by vacuum analyze?
-Looks like it converts the low and high values to text and stores them
-that way. Ugly as can be :-( but I'm not sure there is a good
-alternative. We have no "wild card" column type AFAIK, which is what
-these columns of pg_statistic would have to be to allow storage of
-unconverted min and max values.
-I think you've found a can of worms here. Congratulations ;-)
- regards, tom lane
-From owner-pgsql-hackers@hub.org Mon Mar 22 23:31:00 1999
-Received: from renoir.op.net (root@renoir.op.net [])
- by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id XAA03384
- for <maillist@candle.pha.pa.us>; Mon, 22 Mar 1999 23:30:58 -0500 (EST)
-Received: from hub.org (majordom@hub.org []) by renoir.op.net (o1/$ Revision: 1.18 $) with ESMTP id XAA25586 for <maillist@candle.pha.pa.us>; Mon, 22 Mar 1999 23:18:25 -0500 (EST)
-Received: from localhost (majordom@localhost)
- by hub.org (8.9.2/8.9.1) with SMTP id XAA17955;
- Mon, 22 Mar 1999 23:17:24 -0500 (EST)
- (envelope-from owner-pgsql-hackers@hub.org)
-Received: by hub.org (TLB v0.10a (1.23 tibbs 1997/01/09 00:29:32)); Mon, 22 Mar 1999 23:16:49 +0000 (EST)
-Received: (from majordom@localhost)
- by hub.org (8.9.2/8.9.1) id XAA17764
- for pgsql-hackers-outgoing; Mon, 22 Mar 1999 23:16:46 -0500 (EST)
- (envelope-from owner-pgsql-hackers@postgreSQL.org)
-Received: from po8.andrew.cmu.edu (PO8.ANDREW.CMU.EDU [])
- by hub.org (8.9.2/8.9.1) with ESMTP id XAA17745
- for <pgsql-hackers@postgreSQL.org>; Mon, 22 Mar 1999 23:16:39 -0500 (EST)
- (envelope-from er1p+@andrew.cmu.edu)
-Received: (from postman@localhost) by po8.andrew.cmu.edu (8.8.5/8.8.2) id XAA04273; Mon, 22 Mar 1999 23:16:37 -0500 (EST)
-Received: via switchmail; Mon, 22 Mar 1999 23:16:37 -0500 (EST)
-Received: from hazy.adsl.net.cmu.edu via qmail
- ID </afs/andrew.cmu.edu/service/mailqs/q000/QF.kqxlJ:S00anI00p040>;
- Mon, 22 Mar 1999 23:15:09 -0500 (EST)
-Received: from hazy.adsl.net.cmu.edu via qmail
- ID </afs/andrew.cmu.edu/usr2/er1p/.Outgoing/QF.MqxlJ3q00anI01hKE0>;
- Mon, 22 Mar 1999 23:15:00 -0500 (EST)
-Received: from mms.4.60.Jun.27.1996.03.02.53.sun4.51.EzMail.2.0.CUILIB.3.45.SNAP.NOT.LINKED.hazy.adsl.net.cmu.edu.sun4m.54
- via MS.5.6.hazy.adsl.net.cmu.edu.sun4_51;
- Mon, 22 Mar 1999 23:14:55 -0500 (EST)
-Message-ID: <4qxlJ0200anI01hK40@andrew.cmu.edu>
-Date: Mon, 22 Mar 1999 23:14:55 -0500 (EST)
-From: Erik Riedel <riedel+@CMU.EDU>
-To: Tom Lane <tgl@sss.pgh.pa.us>
-Subject: Re: [HACKERS] optimizer and type question
-Cc: pgsql-hackers@postgreSQL.org
-In-Reply-To: <23292.922151563@sss.pgh.pa.us>
-References: <23292.922151563@sss.pgh.pa.us>
-Sender: owner-pgsql-hackers@postgreSQL.org
-Precedence: bulk
-Status: ROr
-OK, building on your high-level explanation, I am attaching a patch that
-attempts to do something "better" than the current code. Note that I
-have only tested this with the date type and my particular query. I
-haven't run it through the regression, so consider it "proof of concept"
-at best. Although hopefully it will serve my purposes.
-> My thought is that what the staop column ought to be is the OID of the
-> comparison function that was used to determine the sort order of the
-> column. Without a sort op the lowest and highest keys in the column are
-> not well defined, so it makes no sense to assert "these are the lowest
-> and highest values" without providing the sort op that determined that.
-> (For sufficiently complex data types one could reasonably have multiple
-> ordering operators. A crude example is sorting on "circumference" and
-> "area" for polygons.) But typically the sort op will be the "<"
-> operator for the column data type.
-I changed vacuum.c to do exactly that. oid of the lt sort op.
-> So, the vacuum code is definitely broken --- it's not storing the sort
-> op that it used. The code in gethilokey might be broken too, depending
-> on how it is producing the operator it's trying to match against the
-> tuple. For example, if the actual operator in the query is any of
-> < <= > >= on int4, then int4lt ought to be used to probe the pg_statistic
-> table. I'm not sure if we have adequate info in pg_operator or pg_type
-> to let the optimizer code determine the right thing to probe with :-(
-This indeed seems like a bigger problem. I thought about somehow using
-type-matching from the sort op and the actual operator in the query - if
-both the left and right type match, then consider them the same for
-purposes of this probe. That seemed complicated, so I punted in my
-example - it just does the search with relid and attnum and assumes that
-only returns one tuple. This works in my case (maybe in all cases,
-because of the way vacuum is currently written - ?).
-> What we really want here is to be able to map datatype values into
-> some sort of numeric range so that we can compute what fraction of the
-> low-key-to-high-key range is on each side of the probe value (the
-> constant taken from the query). This general concept will apply to
-> many scalar types, so what we want is a type-specific mapping function
-> and a less-specific fraction-computing-function. Offhand I'd say that
-> we want intltsel() and floatltsel(), plus conversion routines that can
-> produce either int4 or float8 from a data type as seems appropriate.
-> Anything that couldn't map to one or the other would have to supply its
-> own selectivity function.
-This is what my example then does. Uses the stored sort op to get the
-type and then uses typinput to convert from the string to an int4.
-Then puts the int4 back into string format because that's what everyone
-was expecting.
-It seems to work for my particular query. I now get:
-(selfuncs) gethilokey() obj 18663 attr 11 opid 1096 (ignored)
-(selfuncs) gethilokey() found op 1087 in pg_proc
-(selfuncs) gethilokey() found type 1082 in pg_type
-(selfuncs) gethilokey() going to use 1084 to convert type 1082
-(selfuncs) gethilokey() have low -2921 high -396
-(selfuncs) intltsel() high -396 low -2921 val -486
-(plancat) restriction_selectivity() for func 103 op 1096 rel 18663 attr
-11 const -486 flag 3 returns 0.964356
-Sort (cost=34467.88 size=0 width=0)
- -> Aggregate (cost=34467.88 size=0 width=0)
- -> Group (cost=34467.88 size=0 width=0)
- -> Sort (cost=34467.88 size=0 width=0)
- -> Seq Scan on lineitem (cost=34467.88 size=579166 width=44)
-including my printfs, which exist in the patch as well.
-Selectivity is now the expected 96% and the size estimate for the seq
-scan is much closer to correct.
-Again, not tested with anything besides date, so caveat not-tested.
-Hope this helps.
-#! /bin/sh
-# This is a shell archive, meaning:
-# 1. Remove everything above the #! /bin/sh line.
-# 2. Save the resulting text in a file.
-# 3. Execute the file with /bin/sh (not csh) to create:
-# selfuncs.c.diff
-# vacuum.c.diff
-# This archive created: Mon Mar 22 22:58:14 1999
-export PATH; PATH=/bin:/usr/bin:$PATH
-if test -f 'selfuncs.c.diff'
- echo shar: "will not over-write existing file 'selfuncs.c.diff'"
-cat << \SHAR_EOF > 'selfuncs.c.diff'
-/selfuncs.c Thu Mar 11 23:59:35 1999
-/selfuncs.c Mon Mar 22 22:57:25 1999
-*** 32,37 ****
---- 32,40 ----
- #include "utils/lsyscache.h" /* for get_oprrest() */
- #include "catalog/pg_statistic.h"
-+ #include "catalog/pg_proc.h" /* for Form_pg_proc */
-+ #include "catalog/pg_type.h" /* for Form_pg_type */
- /* N is not a valid var/constant or relation id */
- #define NONVALUE(N) ((N) == -1)
-*** 103,110 ****
- bottom;
- result = (float64) palloc(sizeof(float64data));
-! if (NONVALUE(attno) || NONVALUE(relid))
- *result = 1.0 / 3;
- else
- {
- /* XXX val = atol(value); */
---- 106,114 ----
- bottom;
- result = (float64) palloc(sizeof(float64data));
-! if (NONVALUE(attno) || NONVALUE(relid)) {
- *result = 1.0 / 3;
-+ }
- else
- {
- /* XXX val = atol(value); */
-*** 117,130 ****
- }
- high = atol(highchar);
- low = atol(lowchar);
- if ((flag & SEL_RIGHT && val < low) ||
- (!(flag & SEL_RIGHT) && val > high))
- {
- float32data nvals;
- nvals = getattdispersion(relid, (int) attno);
-! if (nvals == 0)
- *result = 1.0 / 3.0;
- else
- {
- *result = 3.0 * (float64data) nvals;
---- 121,136 ----
- }
- high = atol(highchar);
- low = atol(lowchar);
-+ printf("(selfuncs) intltsel() high %d low %d val %d\n",high,low,val);
- if ((flag & SEL_RIGHT && val < low) ||
- (!(flag & SEL_RIGHT) && val > high))
- {
- float32data nvals;
- nvals = getattdispersion(relid, (int) attno);
-! if (nvals == 0) {
- *result = 1.0 / 3.0;
-+ }
- else
- {
- *result = 3.0 * (float64data) nvals;
-*** 336,341 ****
---- 342,353 ----
- {
- Relation rel;
- HeapScanDesc scan;
-+ /* this assumes there is only one row in the statistics table for any
-particular */
-+ /* relid, attnum pair - could be more complicated if staop is also
-used. */
-+ /* at the moment, if there are multiple rows, this code ends up
-picking the */
-+ /* "first" one
- - er1p */
-+ /* the actual "ignoring" is done in the call to heap_beginscan()
-below, where */
-+ /* we only mention 2 of the 3 keys in this array
- - er1p */
- static ScanKeyData key[3] = {
- {0, Anum_pg_statistic_starelid, F_OIDEQ, {0, 0, F_OIDEQ}},
- {0, Anum_pg_statistic_staattnum, F_INT2EQ, {0, 0, F_INT2EQ}},
-*** 344,355 ****
- bool isnull;
- HeapTuple tuple;
- rel = heap_openr(StatisticRelationName);
- key[0].sk_argument = ObjectIdGetDatum(relid);
- key[1].sk_argument = Int16GetDatum((int16) attnum);
- key[2].sk_argument = ObjectIdGetDatum(opid);
-! scan = heap_beginscan(rel, 0, SnapshotNow, 3, key);
- tuple = heap_getnext(scan, 0);
- if (!HeapTupleIsValid(tuple))
- {
---- 356,377 ----
- bool isnull;
- HeapTuple tuple;
-+ HeapTuple tup;
-+ Form_pg_proc proc;
-+ Form_pg_type typ;
-+ Oid which_op;
-+ Oid which_type;
-+ int32 low_value;
-+ int32 high_value;
- rel = heap_openr(StatisticRelationName);
- key[0].sk_argument = ObjectIdGetDatum(relid);
- key[1].sk_argument = Int16GetDatum((int16) attnum);
- key[2].sk_argument = ObjectIdGetDatum(opid);
-! printf("(selfuncs) gethilokey() obj %d attr %d opid %d (ignored)\n",
-! key[0].sk_argument,key[1].sk_argument,key[2].sk_argument);
-! scan = heap_beginscan(rel, 0, SnapshotNow, 2, key);
- tuple = heap_getnext(scan, 0);
- if (!HeapTupleIsValid(tuple))
- {
-*** 376,383 ****
---- 398,461 ----
- &isnull));
- if (isnull)
- elog(DEBUG, "gethilokey: low key is null");
- heap_endscan(scan);
- heap_close(rel);
-+ /* now we deal with type conversion issues
- */
-+ /* when intltsel() calls this routine (who knows what other callers
-might do) */
-+ /* it assumes that it can call atol() on the strings and then use
-integer */
-+ /* comparison from there. what we are going to do here, then, is try
-to use */
-+ /* the type information from Anum_pg_statistic_staop to convert the
-high */
-+ /* and low values
-- er1p */
-+ /* WARNING: this code has only been tested with the date type and has
-NOT */
-+ /* been regression tested. consider it "sample" code of what might
-be the */
-+ /* right kind of thing to do
-- er1p */
-+ /* get the 'op' from pg_statistic and look it up in pg_proc */
-+ which_op = heap_getattr(tuple,
-+ Anum_pg_statistic_staop,
-+ RelationGetDescr(rel),
-+ &isnull);
-+ if (InvalidOid == which_op) {
-+ /* ignore all this stuff, try conversion only if we have a valid staop */
-+ /* note that there is an accompanying change to 'vacuum analyze' that */
-+ /* gets this set to something useful. */
-+ } else {
-+ /* staop looks valid, so let's see what we can do about conversion */
-+ tup = SearchSysCacheTuple(PROOID, ObjectIdGetDatum(which_op), 0, 0, 0);
-+ if (!HeapTupleIsValid(tup)) {
-+ elog(ERROR, "selfuncs: unable to find op in pg_proc %d", which_op);
-+ }
-+ printf("(selfuncs) gethilokey() found op %d in pg_proc\n",which_op);
-+ /* use that to determine the type of stahikey and stalokey via pg_type */
-+ proc = (Form_pg_proc) GETSTRUCT(tup);
-+ which_type = proc->proargtypes[0]; /* XXX - use left and right
-separately? */
-+ tup = SearchSysCacheTuple(TYPOID, ObjectIdGetDatum(which_type), 0, 0, 0);
-+ if (!HeapTupleIsValid(tup)) {
-+ elog(ERROR, "selfuncs: unable to find type in pg_type %d", which_type);
-+ }
-+ printf("(selfuncs) gethilokey() found type %d in pg_type\n",which_type);
-+ /* and use that type to get the conversion function to int4 */
-+ typ = (Form_pg_type) GETSTRUCT(tup);
-+ printf("(selfuncs) gethilokey() going to use %d to convert type
-+ /* and convert the low and high strings */
-+ low_value = (int32) fmgr(typ->typinput, *low, -1);
-+ high_value = (int32) fmgr(typ->typinput, *high, -1);
-+ printf("(selfuncs) gethilokey() have low %d high
-+ /* now we have int4's, which we put back into strings because
-that's what out */
-+ /* callers (intltsel() at least) expect
- - er1p */
-+ pfree(*low); pfree(*high); /* let's not leak the old strings */
-+ *low = int4out(low_value);
-+ *high = int4out(high_value);
-+ /* XXX - this probably leaks the two tups we got from
-SearchSysCacheTuple() - er1p */
-+ }
- }
- float64
-if test -f 'vacuum.c.diff'
- echo shar: "will not over-write existing file 'vacuum.c.diff'"
-cat << \SHAR_EOF > 'vacuum.c.diff'
-vacuum.c Thu Mar 11 23:59:09 1999
-vacuum.c Mon Mar 22 21:23:15 1999
-*** 1842,1848 ****
- i = 0;
- values[i++] = (Datum) relid; /* 1 */
- values[i++] = (Datum) attp->attnum; /* 2 */
-! values[i++] = (Datum) InvalidOid; /* 3 */
- fmgr_info(stats->outfunc, &out_function);
- out_string = (*fmgr_faddr(&out_function)) (stats->min,
- values[i++] = (Datum) fmgr(F_TEXTIN, out_string);
---- 1842,1848 ----
- i = 0;
- values[i++] = (Datum) relid; /* 1 */
- values[i++] = (Datum) attp->attnum; /* 2 */
-! values[i++] = (Datum) stats->f_cmplt.fn_oid; /* 3 */ /* get the
-'<' oid, instead of 'invalid' - er1p */
- fmgr_info(stats->outfunc, &out_function);
- out_string = (*fmgr_faddr(&out_function)) (stats->min,
- values[i++] = (Datum) fmgr(F_TEXTIN, out_string);
-exit 0
-# End of shell archive
-From owner-pgsql-hackers@hub.org Tue Mar 23 12:31:05 1999
-Received: from renoir.op.net (root@renoir.op.net [])
- by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id MAA17491
- for <maillist@candle.pha.pa.us>; Tue, 23 Mar 1999 12:31:04 -0500 (EST)
-Received: from hub.org (majordom@hub.org []) by renoir.op.net (o1/$ Revision: 1.18 $) with ESMTP id MAA08839 for <maillist@candle.pha.pa.us>; Tue, 23 Mar 1999 12:08:14 -0500 (EST)
-Received: from localhost (majordom@localhost)
- by hub.org (8.9.2/8.9.1) with SMTP id MAA93649;
- Tue, 23 Mar 1999 12:04:57 -0500 (EST)
- (envelope-from owner-pgsql-hackers@hub.org)
-Received: by hub.org (TLB v0.10a (1.23 tibbs 1997/01/09 00:29:32)); Tue, 23 Mar 1999 12:03:00 +0000 (EST)
-Received: (from majordom@localhost)
- by hub.org (8.9.2/8.9.1) id MAA93355
- for pgsql-hackers-outgoing; Tue, 23 Mar 1999 12:02:55 -0500 (EST)
- (envelope-from owner-pgsql-hackers@postgreSQL.org)
-Received: from sss.sss.pgh.pa.us (sss.pgh.pa.us [])
- by hub.org (8.9.2/8.9.1) with ESMTP id MAA93336
- for <pgsql-hackers@postgreSQL.org>; Tue, 23 Mar 1999 12:02:43 -0500 (EST)
- (envelope-from tgl@sss.pgh.pa.us)
-Received: from sss.sss.pgh.pa.us (localhost [])
- by sss.sss.pgh.pa.us (8.9.1/8.9.1) with ESMTP id MAA24455;
- Tue, 23 Mar 1999 12:01:57 -0500 (EST)
-To: Erik Riedel <riedel+@CMU.EDU>
-cc: pgsql-hackers@postgreSQL.org
-Subject: Re: [HACKERS] optimizer and type question
-In-reply-to: Your message of Mon, 22 Mar 1999 23:14:55 -0500 (EST)
- <4qxlJ0200anI01hK40@andrew.cmu.edu>
-Date: Tue, 23 Mar 1999 12:01:57 -0500
-Message-ID: <24453.922208517@sss.pgh.pa.us>
-From: Tom Lane <tgl@sss.pgh.pa.us>
-Sender: owner-pgsql-hackers@postgreSQL.org
-Precedence: bulk
-Status: RO
-Erik Riedel <riedel+@CMU.EDU> writes:
-> OK, building on your high-level explanation, I am attaching a patch that
-> attempts to do something "better" than the current code. Note that I
-> have only tested this with the date type and my particular query.
-Glad to see you working on this. I don't like the details of your
-patch too much though ;-). Here are some suggestions for making it
-1. I think just removing staop from the lookup in gethilokey is OK for
-now, though I'm dubious about Bruce's thought that we could delete that
-field entirely. As you observe, vacuum will not currently put more
-than one tuple for a column into pg_statistic, so we can just do the
-lookup with relid and attno and leave it at that. But I think we ought
-to leave the field there, with the idea that vacuum might someday
-compute more than one statistic for a data column. Fixing vacuum to
-put its sort op into the field is a good idea in the meantime.
-2. The type conversion you're doing in gethilokey is a mess; I think
-what you ought to make it do is simply the inbound conversion of the
-string from pg_statistic into the internal representation for the
-column's datatype, and return that value as a Datum. It also needs
-a cleaner success/failure return convention --- this business with
-"n" return is ridiculously type-specific. Also, the best and easiest
-way to find the type to convert to is to look up the column type in
-the info for the given relid, not search pg_proc with the staop value.
-(I'm not sure that will even work, since there are pg_proc entries
-with wildcard argument types.)
-3. The atol() calls currently found in intltsel are a type-specific
-cheat on what is conceptually a two-step process:
- * Convert the string stored in pg_statistic back to the internal
- form for the column data type.
- * Generate a numeric representation of the data value that can be
- used as an estimate of the range of values in the table.
-The second step is trivial for integers, which may obscure the fact
-that there are two steps involved, but nonetheless there are. If
-you think about applying selectivity logic to strings, say, it
-becomes clear that the second step is a necessary component of the
-process. Furthermore, the second step must also be applied to the
-probe value that's being passed into the selectivity operator.
-(The probe value is already in internal form, of course; but it is
-not necessarily in a useful numeric form.)
-We can do the first of these steps by applying the appropriate "XXXin"
-conversion function for the column data type, as you have done. The
-interesting question is how to do the second one. A really clean
-solution would require adding a column to pg_type that points to a
-function that will do the appropriate conversion. I'd be inclined to
-make all of these functions return "double" (float8) and just have one
-top-level selectivity routine for all data types that can use
-range-based selectivity logic.
-We could probably hack something together that would not use an explicit
-conversion function for each data type, but instead would rely on
-type-specific assumptions inside the selectivity routines. We'd need many
-more selectivity routines though (at least one for each of int, float4,
-float8, and text data types) so I'm not sure we'd really save any work
-compared to doing it right.
-BTW, now that I look at this issue it's real clear that the selectivity
-entries in pg_operator are horribly broken. The intltsel/intgtsel
-selectivity routines are currently applied to 32 distinct data types:
-regression=> select distinct typname,oprleft from pg_operator, pg_type
-regression-> where pg_type.oid = oprleft
-regression-> and oprrest in (103,104);
-typname |oprleft
-_aclitem | 1034
-abstime | 702
-bool | 16
-box | 603
-bpchar | 1042
-char | 18
-cidr | 650
-circle | 718
-date | 1082
-datetime | 1184
-float4 | 700
-float8 | 701
-inet | 869
-int2 | 21
-int4 | 23
-int8 | 20
-line | 628
-lseg | 601
-macaddr | 829
-money | 790
-name | 19
-numeric | 1700
-oid | 26
-oid8 | 30
-path | 602
-point | 600
-polygon | 604
-text | 25
-time | 1083
-timespan | 1186
-timestamp| 1296
-varchar | 1043
-(32 rows)
-many of which are very obviously not compatible with integer for *any*
-purpose. It looks to me like a lot of data types were added to
-pg_operator just by copy-and-paste, without paying attention to whether
-the selectivity routines were actually correct for the data type.
-As the code stands today, the bogus entries don't matter because
-gethilokey always fails, so we always get 1/3 as the selectivity
-estimate for any comparison operator (except = and != of course).
-I had actually noticed that fact and assumed that it was supposed
-to work that way :-(. But, clearly, there is code in here that
-is *trying* to be smarter.
-As soon as we fix gethilokey so that it can succeed, we will start
-getting essentially-random selectivity estimates for those data types
-that aren't actually binary-compatible with integer. That will not do;
-we have to do something about the issue.
- regards, tom lane
-From tgl@sss.pgh.pa.us Tue Mar 23 12:31:02 1999
-Received: from renoir.op.net (root@renoir.op.net [])
- by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id MAA17484
- for <maillist@candle.pha.pa.us>; Tue, 23 Mar 1999 12:31:01 -0500 (EST)
-Received: from sss.sss.pgh.pa.us (sss.pgh.pa.us []) by renoir.op.net (o1/$ Revision: 1.18 $) with ESMTP id MAA09042 for <maillist@candle.pha.pa.us>; Tue, 23 Mar 1999 12:10:55 -0500 (EST)
-Received: from sss.sss.pgh.pa.us (localhost [])
- by sss.sss.pgh.pa.us (8.9.1/8.9.1) with ESMTP id MAA24474;
- Tue, 23 Mar 1999 12:09:52 -0500 (EST)
-To: Bruce Momjian <maillist@candle.pha.pa.us>
-cc: riedel+@CMU.EDU, pgsql-hackers@postgreSQL.org
-Subject: Re: [HACKERS] optimizer and type question
-In-reply-to: Your message of Mon, 22 Mar 1999 21:25:45 -0500 (EST)
- <199903230225.VAA01641@candle.pha.pa.us>
-Date: Tue, 23 Mar 1999 12:09:52 -0500
-Message-ID: <24471.922208992@sss.pgh.pa.us>
-From: Tom Lane <tgl@sss.pgh.pa.us>
-Status: RO
-Bruce Momjian <maillist@candle.pha.pa.us> writes:
-> What we really need is some way to determine how far the requested value
-> is from the min/max values. With int, we just do (val-min)/(max-min).
-> That works, but how do we do that for types that don't support division.
-> Strings come to mind in this case.
-What I'm envisioning is that we still apply the (val-min)/(max-min)
-logic, but apply it to numeric values that are produced in a
-type-dependent way.
-For ints and floats the conversion is trivial, of course.
-For strings, the first thing that comes to mind is to return 0 for a
-null string and the value of the first byte for a non-null string.
-This would give you one-part-in-256 selectivity which is plenty good
-enough for what the selectivity code needs to do. (Actually, it's
-only that good if the strings' first bytes are pretty well spread out.
-If you have a table containing English words, for example, you might
-only get about one part in 26 this way, since the first bytes will
-probably only run from A to Z. Might be better to use the first two
-characters of the string to compute the selectivity representation.)
-In general, you can apply this logic as long as you can come up with
-some numerical approximation to the data type's sorting order. It
-doesn't have to be exact.
- regards, tom lane
-From owner-pgsql-hackers@hub.org Tue Mar 23 12:31:03 1999
-Received: from renoir.op.net (root@renoir.op.net [])
- by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id MAA17488
- for <maillist@candle.pha.pa.us>; Tue, 23 Mar 1999 12:31:02 -0500 (EST)
-Received: from hub.org (majordom@hub.org []) by renoir.op.net (o1/$ Revision: 1.18 $) with ESMTP id MAA09987 for <maillist@candle.pha.pa.us>; Tue, 23 Mar 1999 12:21:34 -0500 (EST)
-Received: from localhost (majordom@localhost)
- by hub.org (8.9.2/8.9.1) with SMTP id MAA95155;
- Tue, 23 Mar 1999 12:18:33 -0500 (EST)
- (envelope-from owner-pgsql-hackers@hub.org)
-Received: by hub.org (TLB v0.10a (1.23 tibbs 1997/01/09 00:29:32)); Tue, 23 Mar 1999 12:17:00 +0000 (EST)
-Received: (from majordom@localhost)
- by hub.org (8.9.2/8.9.1) id MAA94857
- for pgsql-hackers-outgoing; Tue, 23 Mar 1999 12:16:56 -0500 (EST)
- (envelope-from owner-pgsql-hackers@postgreSQL.org)
-Received: from sss.sss.pgh.pa.us (sss.pgh.pa.us [])
- by hub.org (8.9.2/8.9.1) with ESMTP id MAA94469
- for <pgsql-hackers@postgreSQL.org>; Tue, 23 Mar 1999 12:11:33 -0500 (EST)
- (envelope-from tgl@sss.pgh.pa.us)
-Received: from sss.sss.pgh.pa.us (localhost [])
- by sss.sss.pgh.pa.us (8.9.1/8.9.1) with ESMTP id MAA24474;
- Tue, 23 Mar 1999 12:09:52 -0500 (EST)
-To: Bruce Momjian <maillist@candle.pha.pa.us>
-cc: riedel+@CMU.EDU, pgsql-hackers@postgreSQL.org
-Subject: Re: [HACKERS] optimizer and type question
-In-reply-to: Your message of Mon, 22 Mar 1999 21:25:45 -0500 (EST)
- <199903230225.VAA01641@candle.pha.pa.us>
-Date: Tue, 23 Mar 1999 12:09:52 -0500
-Message-ID: <24471.922208992@sss.pgh.pa.us>
-From: Tom Lane <tgl@sss.pgh.pa.us>
-Sender: owner-pgsql-hackers@postgreSQL.org
-Precedence: bulk
-Status: RO
-Bruce Momjian <maillist@candle.pha.pa.us> writes:
-> What we really need is some way to determine how far the requested value
-> is from the min/max values. With int, we just do (val-min)/(max-min).
-> That works, but how do we do that for types that don't support division.
-> Strings come to mind in this case.
-What I'm envisioning is that we still apply the (val-min)/(max-min)
-logic, but apply it to numeric values that are produced in a
-type-dependent way.
-For ints and floats the conversion is trivial, of course.
-For strings, the first thing that comes to mind is to return 0 for a
-null string and the value of the first byte for a non-null string.
-This would give you one-part-in-256 selectivity which is plenty good
-enough for what the selectivity code needs to do. (Actually, it's
-only that good if the strings' first bytes are pretty well spread out.
-If you have a table containing English words, for example, you might
-only get about one part in 26 this way, since the first bytes will
-probably only run from A to Z. Might be better to use the first two
-characters of the string to compute the selectivity representation.)
-In general, you can apply this logic as long as you can come up with
-some numerical approximation to the data type's sorting order. It
-doesn't have to be exact.
- regards, tom lane
-From owner-pgsql-hackers@hub.org Thu Jul 1 20:39:19 1999
-Received: from hub.org (hub.org [])
- by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id UAA15403
- for <maillist@candle.pha.pa.us>; Thu, 1 Jul 1999 20:39:18 -0400 (EDT)
-Received: from hub.org (hub.org [])
- by hub.org (8.9.3/8.9.3) with ESMTP id UAA45018;
- Thu, 1 Jul 1999 20:20:27 -0400 (EDT)
- (envelope-from owner-pgsql-hackers@hub.org)
-Received: by hub.org (TLB v0.10a (1.23 tibbs 1997/01/09 00:29:32)); Thu, 01 Jul 1999 20:15:30 +0000 (EDT)
-Received: (from majordom@localhost)
- by hub.org (8.9.3/8.9.3) id UAA44474
- for pgsql-hackers-outgoing; Thu, 1 Jul 1999 20:15:28 -0400 (EDT)
- (envelope-from owner-pgsql-hackers@postgreSQL.org)
-X-Authentication-Warning: hub.org: majordom set sender to owner-pgsql-hackers@postgreSQL.org using -f
-Received: from sd.tpf.co.jp (sd.tpf.co.jp [])
- by hub.org (8.9.3/8.9.3) with ESMTP id UAA44058
- for <pgsql-hackers@postgreSQL.org>; Thu, 1 Jul 1999 20:12:10 -0400 (EDT)
- (envelope-from Inoue@tpf.co.jp)
-Received: from cadzone ([] (may be forged))
- by sd.tpf.co.jp (2.5 Build 2640 (Berkeley 8.8.6)/8.8.4) with SMTP
- id JAA00279 for <pgsql-hackers@postgreSQL.org>; Fri, 02 Jul 1999 09:11:58 +0900
-From: "Hiroshi Inoue" <Inoue@tpf.co.jp>
-To: "pgsql-hackers" <pgsql-hackers@postgreSQL.org>
-Subject: [HACKERS] Optimization FAQ ?
-Date: Fri, 2 Jul 1999 09:14:10 +0900
-Message-ID: <000401bec41f$ce81dcc0$2801007e@cadzone.tpf.co.jp>
-MIME-Version: 1.0
-Content-Type: text/plain;
- charset="iso-2022-jp"
-Content-Transfer-Encoding: 7bit
-X-Priority: 3 (Normal)
-X-MSMail-Priority: Normal
-X-Mailer: Microsoft Outlook 8.5, Build 4.71.2173.0
-X-MimeOLE: Produced By Microsoft MimeOLE V4.72.2106.4
-Importance: Normal
-Sender: owner-pgsql-hackers@postgreSQL.org
-Precedence: bulk
-Status: RO
-Hello all,
-I got the following result.
-It's FAQ ?
-drop table int2t;
-create table int2t (id int2 primary key);
-explain select * from int2t where id=1;
- Seq Scan on int2t (cost=43.00 rows=2 width=2)
-explain select * from int2t where id=1::int2;
- Index Scan using int2t_pkey on int2t (cost=2.05 rows=2 width=2)
-explain select * from int2t where id='1';
- Index Scan using int2t_pkey on int2t (cost=2.05 rows=2 width=2)
-Right behavior ?
-Hiroshi Inoue
-From owner-pgsql-hackers@hub.org Thu Jan 20 18:45:32 2000
-Received: from renoir.op.net (root@renoir.op.net [])
- by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id TAA00672
- for <pgman@candle.pha.pa.us>; Thu, 20 Jan 2000 19:45:30 -0500 (EST)
-Received: from hub.org (hub.org []) by renoir.op.net (o1/$Revision: 1.20 $) with ESMTP id TAA01989 for <pgman@candle.pha.pa.us>; Thu, 20 Jan 2000 19:39:15 -0500 (EST)
-Received: from localhost (majordom@localhost)
- by hub.org (8.9.3/8.9.3) with SMTP id TAA00957;
- Thu, 20 Jan 2000 19:35:19 -0500 (EST)
- (envelope-from owner-pgsql-hackers)
-Received: by hub.org (bulk_mailer v1.5); Thu, 20 Jan 2000 19:33:34 -0500
-Received: (from majordom@localhost)
- by hub.org (8.9.3/8.9.3) id TAA00581
- for pgsql-hackers-outgoing; Thu, 20 Jan 2000 19:32:37 -0500 (EST)
- (envelope-from owner-pgsql-hackers@postgreSQL.org)
-Received: from sss2.sss.pgh.pa.us (sss.pgh.pa.us [])
- by hub.org (8.9.3/8.9.3) with ESMTP id TAA98940
- for <pgsql-hackers@postgreSQL.org>; Thu, 20 Jan 2000 19:31:49 -0500 (EST)
- (envelope-from tgl@sss.pgh.pa.us)
-Received: from sss2.sss.pgh.pa.us (tgl@localhost [])
- by sss2.sss.pgh.pa.us (8.9.3/8.9.3) with ESMTP id TAA25390
- for <pgsql-hackers@postgreSQL.org>; Thu, 20 Jan 2000 19:31:32 -0500 (EST)
-To: pgsql-hackers@postgreSQL.org
-Subject: [HACKERS] Some notes on optimizer cost estimates
-Date: Thu, 20 Jan 2000 19:31:32 -0500
-Message-ID: <25387.948414692@sss.pgh.pa.us>
-From: Tom Lane <tgl@sss.pgh.pa.us>
-Sender: owner-pgsql-hackers@postgreSQL.org
-Status: OR
-I have been spending some time measuring actual runtimes for various
-sequential-scan and index-scan query plans, and have learned that the
-current Postgres optimizer's cost estimation equations are not very
-close to reality at all.
-Presently we estimate the cost of a sequential scan as
- Nblocks + CPU_PAGE_WEIGHT * Ntuples
---- that is, the unit of cost is the time to read one disk page,
-and we have a "fudge factor" that relates CPU time per tuple to
-disk time per page. (The default CPU_PAGE_WEIGHT is 0.033, which
-is probably too high for modern hardware --- 0.01 seems like it
-might be a better default, at least for simple queries.) OK,
-it's a simplistic model, but not too unreasonable so far.
-The cost of an index scan is measured in these same terms as
- Nblocks + CPU_PAGE_WEIGHT * Ntuples +
- CPU_INDEX_PAGE_WEIGHT * Nindextuples
-Here Ntuples is the number of tuples selected by the index qual
-condition (typically, it's less than the total table size used in
-sequential-scan estimation). CPU_INDEX_PAGE_WEIGHT essentially
-estimates the cost of scanning an index tuple; by default it's 0.017 or
-half CPU_PAGE_WEIGHT. Nblocks is estimated as the index size plus an
-appropriate fraction of the main table size.
-There are two big problems with this:
-1. Since main-table tuples are visited in index order, we'll be hopping
-around from page to page in the table. The current cost estimation
-method essentially assumes that the buffer cache plus OS disk cache will
-be 100% efficient --- we will never have to read the same page of the
-main table twice in a scan, due to having discarded it between
-references. This of course is unreasonably optimistic. Worst case
-is that we'd fetch a main-table page for each selected tuple, but in
-most cases that'd be unreasonably pessimistic.
-2. The cost of a disk page fetch is estimated at 1.0 unit for both
-sequential and index scans. In reality, sequential access is *much*
-cheaper than the quasi-random accesses performed by an index scan.
-This is partly a matter of physical disk seeks, and partly a matter
-of benefitting (or not) from any read-ahead logic the OS may employ.
-As best I can measure on my hardware, the cost of a nonsequential
-disk read should be estimated at 4 to 5 times the cost of a sequential
-one --- I'm getting numbers like 2.2 msec per disk page for sequential
-scans, and as much as 11 msec per page for index scans. I don't
-know, however, if this ratio is similar enough on other platforms
-to be useful for cost estimating. We could make it a parameter like
-we do for CPU_PAGE_WEIGHT ... but you know and I know that no one
-ever bothers to adjust those numbers in the field ...
-The other effect that needs to be modeled, and currently is not, is the
-"hit rate" of buffer cache. Presumably, this is 100% for tables smaller
-than the cache and drops off as the table size increases --- but I have
-no particular thoughts on the form of the dependency. Does anyone have
-ideas here? The problem is complicated by the fact that we don't really
-know how big the cache is; we know the number of buffers Postgres has,
-but we have no idea how big a disk cache the kernel is keeping. As near
-as I can tell, finding a hit in the kernel disk cache is not a lot more
-expensive than having the page sitting in Postgres' own buffers ---
-certainly it's much much cheaper than a disk read.
-BTW, if you want to do some measurements of your own, try turning on
-PGOPTIONS="-d 2 -te". This will dump a lot of interesting numbers
-into the postmaster log, if your platform supports getrusage().
- regards, tom lane
-From owner-pgsql-hackers@hub.org Thu Jan 20 20:26:33 2000
-Received: from hub.org (hub.org [])
- by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id VAA06630
- for <pgman@candle.pha.pa.us>; Thu, 20 Jan 2000 21:26:32 -0500 (EST)
-Received: from localhost (majordom@localhost)
- by hub.org (8.9.3/8.9.3) with SMTP id VAA35022;
- Thu, 20 Jan 2000 21:22:08 -0500 (EST)
- (envelope-from owner-pgsql-hackers)
-Received: by hub.org (bulk_mailer v1.5); Thu, 20 Jan 2000 21:20:35 -0500
-Received: (from majordom@localhost)
- by hub.org (8.9.3/8.9.3) id VAA34569
- for pgsql-hackers-outgoing; Thu, 20 Jan 2000 21:19:38 -0500 (EST)
- (envelope-from owner-pgsql-hackers@postgreSQL.org)
-Received: from hercules.cs.ucsb.edu (hercules.cs.ucsb.edu [])
- by hub.org (8.9.3/8.9.3) with ESMTP id VAA34534
- for <pgsql-hackers@postgreSQL.org>; Thu, 20 Jan 2000 21:19:26 -0500 (EST)
- (envelope-from xun@cs.ucsb.edu)
-Received: from xp10-06.dialup.commserv.ucsb.edu (root@xp10-06.dialup.commserv.ucsb.edu [])
- by hercules.cs.ucsb.edu (8.8.6/8.8.6) with ESMTP id SAA04655
- for <pgsql-hackers@postgreSQL.org>; Thu, 20 Jan 2000 18:19:22 -0800 (PST)
-Received: from xp10-06.dialup.commserv.ucsb.edu (xun@localhost)
- by xp10-06.dialup.commserv.ucsb.edu (8.9.3/8.9.3) with ESMTP id SAA22377
- for <pgsql-hackers@postgreSQL.org>; Thu, 20 Jan 2000 18:19:40 -0800
-Message-Id: <200001210219.SAA22377@xp10-06.dialup.commserv.ucsb.edu>
-To: pgsql-hackers@postgreSQL.org
-Reply-to: xun@cs.ucsb.edu
-Subject: Re. [HACKERS] Some notes on optimizer cost estimates
-Date: Thu, 20 Jan 2000 18:19:40 -0800
-From: Xun Cheng <xun@cs.ucsb.edu>
-Sender: owner-pgsql-hackers@postgreSQL.org
-Status: OR
-I'm very glad you bring up this cost estimate issue.
-Recent work in database research have argued a more
-detailed disk access cost model should be used for
-large queries especially joins.
-Traditional cost estimate only considers the number of
-disk pages accessed. However a more detailed model
-would consider three parameters: avg. seek, avg. latency
-and avg. page transfer. For old disk, typical values are
-SEEK=9.5 milliseconds, LATENCY=8.3 ms, TRANSFER=2.6ms.
-A sequential continuous reading of a table (assuming
-1000 continuous pages) would cost
-(SEEK+LATENCY+1000*TRANFER=2617.8ms); while quasi-randomly
-reading 200 times with 2 continuous pages/time would
-cost (SEEK+200*LATENCY+400*TRANSFER=2700ms).
-Someone from IBM lab re-studied the traditional
-ad hoc join algorithms (nested, sort-merge, hash) using the detailed cost model
-and found some interesting results.
->I have been spending some time measuring actual runtimes for various
->sequential-scan and index-scan query plans, and have learned that the
->current Postgres optimizer's cost estimation equations are not very
->close to reality at all.
-One interesting question I'd like to ask is if this non-closeness
-really affects the optimal choice of postgresql's query optimizer.
-And to what degree the effects might be? My point is that
-if the optimizer estimated the cost for sequential-scan is 10 and
-the cost for index-scan is 20 while the actual costs are 10 vs. 40,
-it should be ok because the optimizer would still choose sequential-scan
-as it should.
->1. Since main-table tuples are visited in index order, we'll be hopping
->around from page to page in the table.
-I'm not sure about the implementation in postgresql. One thing you might
-be able to do is to first collect all must-read page addresses from
-the index scan and then order them before the actual ordered page fetching.
-It would at least avoid the same page being read twice (not entirely
-true depending on the context (like in join) and algo.)
->The current cost estimation
->method essentially assumes that the buffer cache plus OS disk cache will
->be 100% efficient --- we will never have to read the same page of the
->main table twice in a scan, due to having discarded it between
->references. This of course is unreasonably optimistic. Worst case
->is that we'd fetch a main-table page for each selected tuple, but in
->most cases that'd be unreasonably pessimistic.
-This is actually the motivation that I asked before if postgresql
-has a raw disk facility. That way we have much control on this cache
-issue. Of course only if we can provide some algo. better than OS
-cache algo. (depending on the context, like large joins), a raw disk
-facility will be worthwhile (besides the recoverability).
-Actually I have another question for you guys which is somehow related
-to this cost estimation issue. You know the difference between OLTP
-and OLAP. My question is how you target postgresql on both kinds
-of applications or just OLTP. From what I know OLTP and OLAP would
-have a big difference in query characteristics and thus
-optimization difference. If postgresql is only targeted on
-OLTP, the above cost estimation issue might not be that
-important. However for OLAP, large tables and large queries are
-common and optimization would be difficult.
-From owner-pgsql-hackers@hub.org Thu Jan 20 20:41:44 2000
-Received: from hub.org (hub.org [])
- by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id VAA07020
- for <pgman@candle.pha.pa.us>; Thu, 20 Jan 2000 21:41:43 -0500 (EST)
-Received: from localhost (majordom@localhost)
- by hub.org (8.9.3/8.9.3) with SMTP id VAA40222;
- Thu, 20 Jan 2000 21:34:08 -0500 (EST)
- (envelope-from owner-pgsql-hackers)
-Received: by hub.org (bulk_mailer v1.5); Thu, 20 Jan 2000 21:32:35 -0500
-Received: (from majordom@localhost)
- by hub.org (8.9.3/8.9.3) id VAA38388
- for pgsql-hackers-outgoing; Thu, 20 Jan 2000 21:31:38 -0500 (EST)
- (envelope-from owner-pgsql-hackers@postgreSQL.org)
-Received: from sss2.sss.pgh.pa.us (sss.pgh.pa.us [])
- by hub.org (8.9.3/8.9.3) with ESMTP id VAA37422
- for <pgsql-hackers@postgreSQL.org>; Thu, 20 Jan 2000 21:31:02 -0500 (EST)
- (envelope-from tgl@sss.pgh.pa.us)
-Received: from sss2.sss.pgh.pa.us (tgl@localhost [])
- by sss2.sss.pgh.pa.us (8.9.3/8.9.3) with ESMTP id VAA26761;
- Thu, 20 Jan 2000 21:30:41 -0500 (EST)
-To: "Hiroshi Inoue" <Inoue@tpf.co.jp>
-cc: pgsql-hackers@postgreSQL.org
-Subject: Re: [HACKERS] Some notes on optimizer cost estimates
-In-reply-to: <000b01bf63b1$093cbd40$2801007e@tpf.co.jp>
-References: <000b01bf63b1$093cbd40$2801007e@tpf.co.jp>
-Comments: In-reply-to "Hiroshi Inoue" <Inoue@tpf.co.jp>
- message dated "Fri, 21 Jan 2000 10:44:20 +0900"
-Date: Thu, 20 Jan 2000 21:30:41 -0500
-Message-ID: <26758.948421841@sss.pgh.pa.us>
-From: Tom Lane <tgl@sss.pgh.pa.us>
-Sender: owner-pgsql-hackers@postgreSQL.org
-Status: ORr
-"Hiroshi Inoue" <Inoue@tpf.co.jp> writes:
-> I've wondered why we cound't analyze database without vacuum.
-> We couldn't run vacuum light-heartedly because it acquires an
-> exclusive lock for the target table.
-There is probably no real good reason, except backwards compatibility,
-why the ANALYZE function (obtaining pg_statistic data) is part of
-VACUUM at all --- it could just as easily be a separate command that
-would only use read access on the database. Bruce is thinking about
-restructuring VACUUM, so maybe now is a good time to think about
-splitting out the ANALYZE code too.
-> In addition,vacuum error occurs with analyze option in most
-> cases AFAIK.
-Still, with current sources? What's the error message? I fixed
-a problem with pg_statistic tuples getting too big...
- regards, tom lane
-From tgl@sss.pgh.pa.us Thu Jan 20 21:10:28 2000
-Received: from sss2.sss.pgh.pa.us (sss.pgh.pa.us [])
- by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id WAA08412
- for <pgman@candle.pha.pa.us>; Thu, 20 Jan 2000 22:10:26 -0500 (EST)
-Received: from sss2.sss.pgh.pa.us (tgl@localhost [])
- by sss2.sss.pgh.pa.us (8.9.3/8.9.3) with ESMTP id WAA27080;
- Thu, 20 Jan 2000 22:10:28 -0500 (EST)
-To: Bruce Momjian <pgman@candle.pha.pa.us>
-cc: Hiroshi Inoue <Inoue@tpf.co.jp>, pgsql-hackers@postgresql.org
-Subject: Re: [HACKERS] Some notes on optimizer cost estimates
-In-reply-to: <200001210248.VAA07186@candle.pha.pa.us>
-References: <200001210248.VAA07186@candle.pha.pa.us>
-Comments: In-reply-to Bruce Momjian <pgman@candle.pha.pa.us>
- message dated "Thu, 20 Jan 2000 21:48:57 -0500"
-Date: Thu, 20 Jan 2000 22:10:28 -0500
-Message-ID: <27077.948424228@sss.pgh.pa.us>
-From: Tom Lane <tgl@sss.pgh.pa.us>
-Status: OR
-Bruce Momjian <pgman@candle.pha.pa.us> writes:
-> It is nice that ANALYZE is done during vacuum. I can't imagine why you
-> would want to do an analyze without adding a vacuum to it. I guess
-> that's why I made them the same command.
-Well, the main bad thing about ANALYZE being part of VACUUM is that
-it adds to the length of time that VACUUM is holding an exclusive
-lock on the table. I think it'd make more sense for it to be a
-separate command.
-I have also been thinking about how to make ANALYZE produce a more
-reliable estimate of the most common value. The three-element list
-that it keeps now is a good low-cost hack, but it really doesn't
-produce a trustworthy answer unless the MCV is pretty darn C (since
-it will never pick up on the MCV at all until there are at least
-two occurrences in three adjacent tuples). The only idea I've come
-up with is to use a larger list, which would be slower and take
-more memory. I think that'd be OK in a separate command, but I
-hesitate to do it inside VACUUM --- VACUUM has its own considerable
-memory requirements, and there's still the issue of not holding down
-an exclusive lock longer than you have to.
- regards, tom lane
-From Inoue@tpf.co.jp Thu Jan 20 21:08:32 2000
-Received: from sd.tpf.co.jp (sd.tpf.co.jp [])
- by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id WAA08225
- for <pgman@candle.pha.pa.us>; Thu, 20 Jan 2000 22:08:29 -0500 (EST)
-Received: from cadzone ([] (may be forged))
- by sd.tpf.co.jp (2.5 Build 2640 (Berkeley 8.8.6)/8.8.4) with SMTP
- id MAA04148; Fri, 21 Jan 2000 12:08:30 +0900
-From: "Hiroshi Inoue" <Inoue@tpf.co.jp>
-To: "Bruce Momjian" <pgman@candle.pha.pa.us>, "Tom Lane" <tgl@sss.pgh.pa.us>
-Cc: <pgsql-hackers@postgreSQL.org>
-Subject: RE: [HACKERS] Some notes on optimizer cost estimates
-Date: Fri, 21 Jan 2000 12:14:10 +0900
-Message-ID: <001301bf63bd$95cbe680$2801007e@tpf.co.jp>
-MIME-Version: 1.0
-Content-Type: text/plain;
- charset="iso-8859-1"
-Content-Transfer-Encoding: 7bit
-X-Priority: 3 (Normal)
-X-MSMail-Priority: Normal
-X-Mailer: Microsoft Outlook 8.5, Build 4.71.2173.0
-X-MimeOLE: Produced By Microsoft MimeOLE V5.00.2314.1300
-In-Reply-To: <200001210248.VAA07186@candle.pha.pa.us>
-Importance: Normal
-Status: OR
-> -----Original Message-----
-> From: Bruce Momjian [mailto:pgman@candle.pha.pa.us]
-> > "Hiroshi Inoue" <Inoue@tpf.co.jp> writes:
-> > > I've wondered why we cound't analyze database without vacuum.
-> > > We couldn't run vacuum light-heartedly because it acquires an
-> > > exclusive lock for the target table.
-> >
-> > There is probably no real good reason, except backwards compatibility,
-> > why the ANALYZE function (obtaining pg_statistic data) is part of
-> > VACUUM at all --- it could just as easily be a separate command that
-> > would only use read access on the database. Bruce is thinking about
-> > restructuring VACUUM, so maybe now is a good time to think about
-> > splitting out the ANALYZE code too.
-> I put it in vacuum because at the time I didn't know how to do such
-> things and vacuum already scanned the table. I just linked on the the
-> scan. Seemed like a good idea at the time.
-> It is nice that ANALYZE is done during vacuum. I can't imagine why you
-> would want to do an analyze without adding a vacuum to it. I guess
-> that's why I made them the same command.
-> If I made them separate commands, both would have to scan the table,
-> though the analyze could do it without the exclusive lock, which would
-> be good.
-The functionality of VACUUM and ANALYZE is quite different.
-I don't prefer to charge VACUUM more than now about analyzing
-database. Probably looong lock,more aborts ....
-Various kind of analysis would be possible by splitting out ANALYZE.
-Hiroshi Inoue
-From owner-pgsql-hackers@hub.org Fri Jan 21 11:01:59 2000
-Received: from hub.org (hub.org [])
- by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id MAA07821
- for <pgman@candle.pha.pa.us>; Fri, 21 Jan 2000 12:01:57 -0500 (EST)
-Received: from localhost (majordom@localhost)
- by hub.org (8.9.3/8.9.3) with SMTP id LAA77357;
- Fri, 21 Jan 2000 11:52:25 -0500 (EST)
- (envelope-from owner-pgsql-hackers)
-Received: by hub.org (bulk_mailer v1.5); Fri, 21 Jan 2000 11:50:46 -0500
-Received: (from majordom@localhost)
- by hub.org (8.9.3/8.9.3) id LAA76756
- for pgsql-hackers-outgoing; Fri, 21 Jan 2000 11:49:50 -0500 (EST)
- (envelope-from owner-pgsql-hackers@postgreSQL.org)
-Received: from eclipse.pacifier.com (eclipse.pacifier.com [])
- by hub.org (8.9.3/8.9.3) with ESMTP id LAA76594
- for <pgsql-hackers@postgreSQL.org>; Fri, 21 Jan 2000 11:49:01 -0500 (EST)
- (envelope-from dhogaza@pacifier.com)
-Received: from desktop (dsl-dhogaza.pacifier.net [])
- by eclipse.pacifier.com (8.9.3/8.9.3pop) with SMTP id IAA00225;
- Fri, 21 Jan 2000 08:47:26 -0800 (PST)
-Message-Id: <>
-X-Sender: dhogaza@mail.pacifier.com
-X-Mailer: Windows Eudora Pro Version 3.0.1 (32)
-Date: Fri, 21 Jan 2000 08:10:44 -0800
-To: xun@cs.ucsb.edu, pgsql-hackers@postgreSQL.org
-From: Don Baccus <dhogaza@pacifier.com>
-Subject: Re: Re. [HACKERS] Some notes on optimizer cost estimates
-In-Reply-To: <200001210219.SAA22377@xp10-06.dialup.commserv.ucsb.edu>
-Mime-Version: 1.0
-Content-Type: text/plain; charset="us-ascii"
-Sender: owner-pgsql-hackers@postgreSQL.org
-Status: OR
-At 06:19 PM 1/20/00 -0800, Xun Cheng wrote:
->I'm very glad you bring up this cost estimate issue.
->Recent work in database research have argued a more
->detailed disk access cost model should be used for
->large queries especially joins.
->Traditional cost estimate only considers the number of
->disk pages accessed. However a more detailed model
->would consider three parameters: avg. seek, avg. latency
->and avg. page transfer. For old disk, typical values are
->SEEK=9.5 milliseconds, LATENCY=8.3 ms, TRANSFER=2.6ms.
->A sequential continuous reading of a table (assuming
->1000 continuous pages) would cost
->(SEEK+LATENCY+1000*TRANFER=2617.8ms); while quasi-randomly
->reading 200 times with 2 continuous pages/time would
->cost (SEEK+200*LATENCY+400*TRANSFER=2700ms).
->Someone from IBM lab re-studied the traditional
->ad hoc join algorithms (nested, sort-merge, hash) using the detailed cost
->and found some interesting results.
-One complication when doing an index scan is that you are
-accessing two separate files (table and index), which can frequently
-be expected to cause an considerable increase in average seek time.
-Oracle and other commercial databases recommend spreading indices and
-tables over several spindles if at all possible in order to minimize
-this effect.
-I suspect it also helps their optimizer make decisions that are
-more consistently good for customers with the largest and most
-complex databases and queries, by making cost estimates more predictably
-Still...this doesn't help with the question about the effect of the
-filesystem system cache. I wandered around the web for a little bit
-last night, and found one summary of a paper by Osterhout on the
-effect of the Solaris cache on a fileserver serving diskless workstations.
-There was reference to the hierarchy involved (i.e. the local workstation
-cache is faster than the fileserver's cache which has to be read via
-the network which in turn is faster than reading from the fileserver's
-disk). It appears the rule-of-thumb for the cache-hit ratio on reads,
-presumably based on measuring some internal Sun systems, used in their
-calculations was 80%.
-Just a datapoint to think about.
-There's also considerable operating system theory on paging systems
-that might be useful for thinking about trying to estimate the
-Postgres cache/hit ratio. Then again, maybe Postgres could just
-keep count of how many pages of a given table are in the cache at
-any given time? Or simply keep track of the current ratio of hits
-and misses?
->>I have been spending some time measuring actual runtimes for various
->>sequential-scan and index-scan query plans, and have learned that the
->>current Postgres optimizer's cost estimation equations are not very
->>close to reality at all.
->One interesting question I'd like to ask is if this non-closeness
->really affects the optimal choice of postgresql's query optimizer.
->And to what degree the effects might be? My point is that
->if the optimizer estimated the cost for sequential-scan is 10 and
->the cost for index-scan is 20 while the actual costs are 10 vs. 40,
->it should be ok because the optimizer would still choose sequential-scan
->as it should.
-This is crucial, of course - if there are only two types of scans
-available, what ever heuristic is used only has to be accurate enough
-to pick the right one. Once the choice is made, it doesn't really
-matter (from the optimizer's POV) just how long it will actually take,
-the time will be spent and presumably it will be shorter than the
-How frequently will the optimizer choose wrongly if:
-1. All of the tables and indices were in PG buffer cache or filesystem
- cache? (i.e. fixed access times for both types of scans)
-2. The table's so big that only a small fraction can reside in RAM
- during the scan and join, which means that the non-sequential
- disk access pattern of the indexed scan is much more expensive.
-Also, if you pick sequential scans more frequently based on a presumption
-that index scans are expensive due to increased average seek time, how
-often will this penalize the heavy-duty user that invests in extra
-drives and lots of RAM?
->>The current cost estimation
->>method essentially assumes that the buffer cache plus OS disk cache will
->>be 100% efficient --- we will never have to read the same page of the
->>main table twice in a scan, due to having discarded it between
->>references. This of course is unreasonably optimistic. Worst case
->>is that we'd fetch a main-table page for each selected tuple, but in
->>most cases that'd be unreasonably pessimistic.
->This is actually the motivation that I asked before if postgresql
->has a raw disk facility. That way we have much control on this cache
->issue. Of course only if we can provide some algo. better than OS
->cache algo. (depending on the context, like large joins), a raw disk
->facility will be worthwhile (besides the recoverability).
-Postgres does have control over its buffer cache. The one thing that
-raw disk I/O would give you is control over where blocks are placed,
-meaning you could more accurately model the cost of retrieving them.
-So presumably the cache could be tuned to the allocation algorithm
-used to place various structures on the disk.
-I still wonder just how much gain you get by this approach. Compared,
-to, say simply spending $2,000 on a gigabyte of RAM. Heck, PCs even
-support a couple gigs of RAM now.
->Actually I have another question for you guys which is somehow related
->to this cost estimation issue. You know the difference between OLTP
->and OLAP. My question is how you target postgresql on both kinds
->of applications or just OLTP. From what I know OLTP and OLAP would
->have a big difference in query characteristics and thus
->optimization difference. If postgresql is only targeted on
->OLTP, the above cost estimation issue might not be that
->important. However for OLAP, large tables and large queries are
->common and optimization would be difficult.
-- Don Baccus, Portland OR <dhogaza@pacifier.com>
- Nature photos, on-line guides, Pacific Northwest
- Rare Bird Alert Service and other goodies at
- http://donb.photo.net.
-From pgsql-hackers-owner+M6019@hub.org Mon Aug 21 11:47:56 2000
-Received: from hub.org (root@hub.org [])
- by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id LAA07289
- for <pgman@candle.pha.pa.us>; Mon, 21 Aug 2000 11:47:55 -0400 (EDT)
-Received: from hub.org (majordom@localhost [])
- by hub.org (8.10.1/8.10.1) with SMTP id e7LFlpT03383;
- Mon, 21 Aug 2000 11:47:51 -0400 (EDT)
-Received: from mail.fct.unl.pt (fct1.si.fct.unl.pt [])
- by hub.org (8.10.1/8.10.1) with SMTP id e7LFlaT03243
- for <pgsql-hackers@postgresql.org>; Mon, 21 Aug 2000 11:47:37 -0400 (EDT)
-Received: (qmail 7416 invoked by alias); 21 Aug 2000 15:54:33 -0000
-Received: (qmail 7410 invoked from network); 21 Aug 2000 15:54:32 -0000
-Received: from eros.si.fct.unl.pt (
- by fct1.si.fct.unl.pt with SMTP; 21 Aug 2000 15:54:32 -0000
-Date: Mon, 21 Aug 2000 16:48:08 +0100 (WEST)
-From: =?iso-8859-1?Q?Tiago_Ant=E3o?= <tra@fct.unl.pt>
-X-Sender: tiago@eros.si.fct.unl.pt
-To: Tom Lane <tgl@sss.pgh.pa.us>
-cc: pgsql-hackers@postgresql.org
-Subject: Re: [HACKERS] Optimisation deficiency: currval('seq')-->seq scan,
- constant-->index scan
-In-Reply-To: <1731.966868649@sss.pgh.pa.us>
-Message-ID: <Pine.LNX.4.21.0008211626250.25226-100000@eros.si.fct.unl.pt>
-MIME-Version: 1.0
-Content-Type: TEXT/PLAIN; charset=US-ASCII
-X-Mailing-List: pgsql-hackers@postgresql.org
-Precedence: bulk
-Sender: pgsql-hackers-owner@hub.org
-Status: ORr
-On Mon, 21 Aug 2000, Tom Lane wrote:
-> > One thing it might be interesting (please tell me if you think
-> > otherwise) would be to improve pg with better statistical information, by
-> > using, for example, histograms.
-> Yes, that's been on the todo list for a while.
- If it's ok and nobody is working on that, I'll look on that subject.
- I'll start by looking at the analize portion of vacuum. I'm thinking in
-using arrays for the histogram (I've never used the array data type of
- Should I use 7.0.2 or the cvs version?
-> Interesting article. We do most of what she talks about, but we don't
-> have anything like the ClusterRatio statistic. We need it --- that was
-> just being discussed a few days ago in another thread. Do you have any
-> reference on exactly how DB2 defines that stat?
- I don't remember seeing that information spefically. From what I've
-read I can speculate:
- 1. They have clusterratios for both indexes and the relation itself.
- 2. They might use an index even if there is no "order by" if the table
-has a low clusterratio: just to get the RIDs, then sort the RIDs and
- 3. One possible way to calculate this ratio:
- a) for tables
- SeqScan
- if tuple points to a next tuple on the same page then its
- ratio = # good tuples / # all tuples
- b) for indexes (high speculation ratio here)
- foreach pointed RID in index
- if RID is in same page of next RID in index than mark as
- I suspect that if a tuple size is big (relative to page size) than the
-cluster ratio is always low.
- A tuple might also be "good" if it pointed to the next page.
-From pgsql-hackers-owner+M6152@hub.org Wed Aug 23 13:00:33 2000
-Received: from hub.org (root@hub.org [])
- by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id NAA10259
- for <pgman@candle.pha.pa.us>; Wed, 23 Aug 2000 13:00:33 -0400 (EDT)
-Received: from hub.org (majordom@localhost [])
- by hub.org (8.10.1/8.10.1) with SMTP id e7NGsPN83008;
- Wed, 23 Aug 2000 12:54:25 -0400 (EDT)
-Received: from mail.fct.unl.pt (fct1.si.fct.unl.pt [])
- by hub.org (8.10.1/8.10.1) with SMTP id e7NGniN81749
- for <pgsql-hackers@postgresql.org>; Wed, 23 Aug 2000 12:49:44 -0400 (EDT)
-Received: (qmail 9869 invoked by alias); 23 Aug 2000 15:10:04 -0000
-Received: (qmail 9860 invoked from network); 23 Aug 2000 15:10:04 -0000
-Received: from eros.si.fct.unl.pt (
- by fct1.si.fct.unl.pt with SMTP; 23 Aug 2000 15:10:04 -0000
-Date: Wed, 23 Aug 2000 16:03:42 +0100 (WEST)
-From: =?iso-8859-1?Q?Tiago_Ant=E3o?= <tra@fct.unl.pt>
-X-Sender: tiago@eros.si.fct.unl.pt
-To: Tom Lane <tgl@sss.pgh.pa.us>
-cc: Jules Bean <jules@jellybean.co.uk>, pgsql-hackers@postgresql.org
-Subject: Re: [HACKERS] Optimisation deficiency: currval('seq')-->seq scan,
- constant-->index scan
-In-Reply-To: <27971.967041030@sss.pgh.pa.us>
-Message-ID: <Pine.LNX.4.21.0008231543340.4273-100000@eros.si.fct.unl.pt>
-MIME-Version: 1.0
-Content-Type: TEXT/PLAIN; charset=US-ASCII
-X-Mailing-List: pgsql-hackers@postgresql.org
-Precedence: bulk
-Sender: pgsql-hackers-owner@hub.org
-Status: ORr
-On Wed, 23 Aug 2000, Tom Lane wrote:
-> Yes, we know about that one. We have stats about the most common value
-> in a column, but no information about how the less-common values are
-> distributed. We definitely need stats about several top values not just
-> one, because this phenomenon of a badly skewed distribution is pretty
-> common.
- An end-biased histogram has stats on top values and also on the least
-frequent values. So if a there is a selection on a value that is well
-bellow average, the selectivity estimation will be more acurate. On some
-research papers I've read, it's refered that this is a better approach
-than equi-width histograms (which are said to be the "industry" standard).
- I not sure whether to use a table or a array attribute on pg_stat for
-the histogram, the problem is what could be expected from the size of the
-attribute (being a text). I'm very affraid of the cost of going through
-several tuples on a table (pg_histogram?) during the optimization phase.
- One other idea would be to only have better statistics for special
-attributes requested by the user... something like "analyze special
-Best Regards,
-From pgsql-hackers-owner+M6160@hub.org Thu Aug 24 00:21:39 2000
-Received: from hub.org (root@hub.org [])
- by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id AAA27662
- for <pgman@candle.pha.pa.us>; Thu, 24 Aug 2000 00:21:38 -0400 (EDT)
-Received: from hub.org (majordom@localhost [])
- by hub.org (8.10.1/8.10.1) with SMTP id e7O46w585951;
- Thu, 24 Aug 2000 00:06:58 -0400 (EDT)
-Received: from sss2.sss.pgh.pa.us (sss.pgh.pa.us [])
- by hub.org (8.10.1/8.10.1) with ESMTP id e7O3uv583775
- for <pgsql-hackers@postgresql.org>; Wed, 23 Aug 2000 23:56:57 -0400 (EDT)
-Received: from sss2.sss.pgh.pa.us (tgl@localhost [])
- by sss2.sss.pgh.pa.us (8.9.3/8.9.3) with ESMTP id XAA20973;
- Wed, 23 Aug 2000 23:56:35 -0400 (EDT)
-To: =?iso-8859-1?Q?Tiago_Ant=E3o?= <tra@fct.unl.pt>
-cc: Jules Bean <jules@jellybean.co.uk>, pgsql-hackers@postgresql.org
-Subject: Re: [HACKERS] Optimisation deficiency: currval('seq')-->seq scan, constant-->index scan
-In-reply-to: <Pine.LNX.4.21.0008231543340.4273-100000@eros.si.fct.unl.pt>
-References: <Pine.LNX.4.21.0008231543340.4273-100000@eros.si.fct.unl.pt>
-Comments: In-reply-to =?iso-8859-1?Q?Tiago_Ant=E3o?= <tra@fct.unl.pt>
- message dated "Wed, 23 Aug 2000 16:03:42 +0100"
-Date: Wed, 23 Aug 2000 23:56:35 -0400
-Message-ID: <20970.967089395@sss.pgh.pa.us>
-From: Tom Lane <tgl@sss.pgh.pa.us>
-X-Mailing-List: pgsql-hackers@postgresql.org
-Precedence: bulk
-Sender: pgsql-hackers-owner@hub.org
-Status: OR
-=?iso-8859-1?Q?Tiago_Ant=E3o?= <tra@fct.unl.pt> writes:
-> One other idea would be to only have better statistics for special
-> attributes requested by the user... something like "analyze special
-> table(column)".
-This might actually fall out "for free" from the cheapest way of
-implementing the stats. We've talked before about scanning btree
-indexes directly to obtain data values in sorted order, which makes
-it very easy to find the most common values. If you do that, you
-get good stats for exactly those columns that the user has created
-indexes on. A tad indirect but I bet it'd be effective...
- regards, tom lane
-From pgsql-hackers-owner+M6165@hub.org Thu Aug 24 05:33:02 2000
-Received: from hub.org (root@hub.org [])
- by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id FAA14309
- for <pgman@candle.pha.pa.us>; Thu, 24 Aug 2000 05:33:01 -0400 (EDT)
-Received: from hub.org (majordom@localhost [])
- by hub.org (8.10.1/8.10.1) with SMTP id e7O9X0584670;
- Thu, 24 Aug 2000 05:33:00 -0400 (EDT)
-Received: from athena.office.vi.net (office-gwb.fulham.vi.net [])
- by hub.org (8.10.1/8.10.1) with ESMTP id e7O9Ix581216
- for <pgsql-hackers@postgresql.org>; Thu, 24 Aug 2000 05:19:03 -0400 (EDT)
-Received: from grommit.office.vi.net [] (mail)
- by athena.office.vi.net with esmtp (Exim 3.12 #1 (Debian))
- id 13Rt2Y-00073I-00; Thu, 24 Aug 2000 10:11:14 +0100
-Received: from jules by grommit.office.vi.net with local (Exim 3.12 #1 (Debian))
- id 13Rt2Y-0005GV-00; Thu, 24 Aug 2000 10:11:14 +0100
-Date: Thu, 24 Aug 2000 10:11:14 +0100
-From: Jules Bean <jules@jellybean.co.uk>
-To: Tom Lane <tgl@sss.pgh.pa.us>
-Cc: Tiago Ant?o <tra@fct.unl.pt>, pgsql-hackers@postgresql.org
-Subject: Re: [HACKERS] Optimisation deficiency: currval('seq')-->seq scan, constant-->index scan
-Message-ID: <20000824101113.N17510@grommit.office.vi.net>
-References: <1731.966868649@sss.pgh.pa.us> <Pine.LNX.4.21.0008211626250.25226-100000@eros.si.fct.unl.pt> <20000823133418.F17510@grommit.office.vi.net> <27971.967041030@sss.pgh.pa.us>
-Mime-Version: 1.0
-Content-Type: text/plain; charset=us-ascii
-Content-Disposition: inline
-User-Agent: Mutt/1.2i
-In-Reply-To: <27971.967041030@sss.pgh.pa.us>; from tgl@sss.pgh.pa.us on Wed, Aug 23, 2000 at 10:30:30AM -0400
-X-Mailing-List: pgsql-hackers@postgresql.org
-Precedence: bulk
-Sender: pgsql-hackers-owner@hub.org
-Status: OR
-On Wed, Aug 23, 2000 at 10:30:30AM -0400, Tom Lane wrote:
-> Jules Bean <jules@jellybean.co.uk> writes:
-> > I have in a table a 'category' column which takes a small number of
-> > (basically fixed) values. Here by 'small', I mean ~1000, while the
-> > table itself has ~10 000 000 rows. Some categories have many, many
-> > more rows than others. In particular, there's one category which hits
-> > over half the rows. Because of this (AIUI) postgresql assumes
-> > that the query
-> > select ... from thistable where category='something'
-> > is best served by a seqscan, even though there is an index on
-> > category.
-> Yes, we know about that one. We have stats about the most common value
-> in a column, but no information about how the less-common values are
-> distributed. We definitely need stats about several top values not just
-> one, because this phenomenon of a badly skewed distribution is pretty
-> common.
-ISTM that that might be enough, in fact.
-If you have stats telling you that the most popular value is 'xyz',
-and that it constitutes 50% of the rows (i.e. 5 000 000) then you can
-conclude that, on average, other entries constitute a mere 5 000
-000/999 ~~ 5000 entries, and it would be definitely be enough.
-(That's assuming you store the number of distinct values somewhere).
-> BTW, if your highly-popular value is actually a dummy value ('UNKNOWN'
-> or something like that), a fairly effective workaround is to replace the
-> dummy entries with NULL. The system does account for NULLs separately
-> from real values, so you'd then get stats based on the most common
-> non-dummy value.
-I can't really do that. Even if I could, the distribution is very
-skewed -- so the next most common makes up a very high proportion of
-what's left. I forget the figures exactly.
-From pgsql-hackers-owner+M6154@hub.org Wed Aug 23 14:36:41 2000
-Received: from hub.org (root@hub.org [])
- by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id NAA11076
- for <pgman@candle.pha.pa.us>; Wed, 23 Aug 2000 13:36:41 -0400 (EDT)
-Received: from hub.org (majordom@localhost [])
- by hub.org (8.10.1/8.10.1) with SMTP id e7NHTqN92431;
- Wed, 23 Aug 2000 13:29:52 -0400 (EDT)
-Received: from mail.fct.unl.pt (fct1.si.fct.unl.pt [])
- by hub.org (8.10.1/8.10.1) with SMTP id e7NHM1N90883
- for <pgsql-hackers@hub.org>; Wed, 23 Aug 2000 13:22:01 -0400 (EDT)
-Received: (qmail 13816 invoked by alias); 23 Aug 2000 17:29:02 -0000
-Received: (qmail 13807 invoked from network); 23 Aug 2000 17:29:02 -0000
-Received: from eros.si.fct.unl.pt (
- by fct1.si.fct.unl.pt with SMTP; 23 Aug 2000 17:29:02 -0000
-Date: Wed, 23 Aug 2000 18:22:40 +0100 (WEST)
-From: =?iso-8859-1?Q?Tiago_Ant=E3o?= <tra@fct.unl.pt>
-X-Sender: tiago@eros.si.fct.unl.pt
-To: Tom Lane <tgl@sss.pgh.pa.us>
-cc: =?iso-8859-1?Q?Tiago_Ant=E3o?= <tra@fct.unl.pt>,
- PostgreSQL Hackers list <pgsql-hackers@hub.org>
-Subject: Re: [HACKERS] analyze.c
-In-Reply-To: <28154.967041988@sss.pgh.pa.us>
-Message-ID: <Pine.LNX.4.21.0008231742420.5111-100000@eros.si.fct.unl.pt>
-MIME-Version: 1.0
-Content-Type: TEXT/PLAIN; charset=US-ASCII
-X-Mailing-List: pgsql-hackers@postgresql.org
-Precedence: bulk
-Sender: pgsql-hackers-owner@hub.org
-Status: ORr
-On Wed, 23 Aug 2000, Tom Lane wrote:
-> > What's the big reason not to do that? I know that
-> > there is some code in analyze.c (like comparing) that uses other parts of
-> > pg, but that seems to be easily fixed.
-> Are you proposing not to do any comparisons? It will be interesting to
-> see how you can compute a histogram without any idea of equality or
-> ordering. But if you want that, then you still need the function-call
-> manager as well as the type-specific comparison routines for every
-> datatype that you might be asked to operate on (don't forget
-> user-defined types here).
- I forgot user defined data types :-(, but regarding histograms I think
-the code can be made external (at least for testing purposes):
- 1. I was not suggesting not to do any comparisons, but I think the only
-comparison I need is equality, I don't need order as I don't need to
-calculate mins or maxs (I just need mins and maxes on frequencies, NOT on
-dat itself) to make a histogram.
- 2. The mapping to text guarantees that I have (PQgetvalue returns
-always char* and pg_statistics keeps a "text" anyway) a way of knowing
-about equality regardless of type.
- But at least anything relating to order has to be in.
-> > I'm leaning toward the implementation of end-biased histograms. There is
-> > an introductory reference in the IEEE Data Engineering Bulletin, september
-> > 1995 (available on microsoft research site).
-> Sounds interesting. Can you give us an exact URL?
-BTW, you can get access to SIGMOD CDs with lots of goodies for a very low
-price (at least in 1999 it was a bargain), check out ACM membership for
-I've been reading something about implementation of histograms, and,
-AFAIK, in practice histograms is just a cool name for no more than:
- 1. top ten with frequency for each
- 2. the same for top ten worse
- 3. average for the rest
-I'm writing code get this info (outside pg for now - for testing
-Best Regards,
-PS - again: I'm starting, so, some of my comments can be completly dumb.
-From pgsql-hackers-owner+M7514@hub.org Sun Oct 15 20:38:12 2000
-Received: from hub.org (hub.org [])
- by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id TAA18459
- for <pgman@candle.pha.pa.us>; Sun, 15 Oct 2000 19:38:12 -0400 (EDT)
-Received: from hub.org.org (localhost [])
- by hub.org (8.10.1/8.10.1) with SMTP id e9FNaUR59496;
- Sun, 15 Oct 2000 19:36:30 -0400 (EDT)
-Received: from sss.pgh.pa.us (sss.pgh.pa.us [])
- by hub.org (8.10.1/8.10.1) with ESMTP id e9FNYuR58276
- for <pgsql-hackers@postgresql.org>; Sun, 15 Oct 2000 19:34:56 -0400 (EDT)
-Received: from sss2.sss.pgh.pa.us (tgl@localhost [])
- by sss.pgh.pa.us (8.11.0/8.11.0) with ESMTP id e9FNXaB06046;
- Sun, 15 Oct 2000 19:33:36 -0400 (EDT)
-To: Bruce Momjian <pgman@candle.pha.pa.us>
-cc: Jules Bean <jules@jellybean.co.uk>,
- Alfred Perlstein <bright@wintelcom.net>, pgsql-hackers@postgresql.org
-Subject: Re: [HACKERS] Performance on inserts
-In-reply-to: <200010152320.TAA17944@candle.pha.pa.us>
-References: <200010152320.TAA17944@candle.pha.pa.us>
-Comments: In-reply-to Bruce Momjian <pgman@candle.pha.pa.us>
- message dated "Sun, 15 Oct 2000 19:20:35 -0400"
-Date: Sun, 15 Oct 2000 19:33:36 -0400
-Message-ID: <6043.971652816@sss.pgh.pa.us>
-From: Tom Lane <tgl@sss.pgh.pa.us>
-X-Mailing-List: pgsql-hackers@postgresql.org
-Precedence: bulk
-Sender: pgsql-hackers-owner@hub.org
-Status: ORr
-Bruce Momjian <pgman@candle.pha.pa.us> writes:
-> However, assume tab2.col2 equals 3. I assume this would cause an index
-> scan because the executor doesn't know about the most common value,
-> right? Is it worth trying to improve that?
-Oh, I see: you are assuming that a nestloop join is being done, and
-wondering if it's worthwhile to switch dynamically between seqscan
-and indexscan for each scan of the inner relation, depending on exactly
-what value is being supplied from the outer relation for that scan.
-Not sure if it's worth the trouble or not. Nestloop is usually a
-last-resort join strategy anyway, and is unlikely to be picked when the
-tables are large enough to make performance be a big issue.
- regards, tom lane
-From tgl@sss.pgh.pa.us Mon Oct 16 01:48:27 2000
-Received: from sss.pgh.pa.us (sss.pgh.pa.us [])
- by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id AAA01602
- for <pgman@candle.pha.pa.us>; Mon, 16 Oct 2000 00:48:26 -0400 (EDT)
-Received: from sss2.sss.pgh.pa.us (tgl@localhost [])
- by sss.pgh.pa.us (8.11.1/8.11.1) with ESMTP id e9G4mu521809;
- Mon, 16 Oct 2000 00:48:56 -0400 (EDT)
-To: Bruce Momjian <pgman@candle.pha.pa.us>
-cc: Jules Bean <jules@jellybean.co.uk>,
- Alfred Perlstein <bright@wintelcom.net>, pgsql-hackers@postgresql.org
-Subject: Re: [HACKERS] Performance on inserts
-In-reply-to: <200010160441.AAA01374@candle.pha.pa.us>
-References: <200010160441.AAA01374@candle.pha.pa.us>
-Comments: In-reply-to Bruce Momjian <pgman@candle.pha.pa.us>
- message dated "Mon, 16 Oct 2000 00:41:49 -0400"
-Date: Mon, 16 Oct 2000 00:48:56 -0400
-Message-ID: <21806.971671736@sss.pgh.pa.us>
-From: Tom Lane <tgl@sss.pgh.pa.us>
-Status: OR
-Bruce Momjian <pgman@candle.pha.pa.us> writes:
->> So an inner indexscan for tab1 is definitely a possible plan.
-> Yes, that was my point, that a nested loop could easily be involved if
-> the joined table has a restriction. Is there a TODO item here?
-More like a "to investigate" --- I'm not sold on the idea that a
-dynamic switch in plan types would be a win. Maybe it would be,
-One thing to think about is that it'd be critically dependent on having
-accurate statistics. Currently, the planner only places bets on the
-average behavior over a whole join. If you make a separate bet on each
-scan, then you open up the risk of betting wrong every time, should
-your stats be out-of-date or otherwise misleading.
- regards, tom lane
-From pgsql-hackers-owner+M29943@postgresql.org Thu Oct 3 18:18:27 2002
-Return-path: <pgsql-hackers-owner+M29943@postgresql.org>
-Received: from postgresql.org (postgresql.org [])
- by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g93MIOU23771
- for <pgman@candle.pha.pa.us>; Thu, 3 Oct 2002 18:18:25 -0400 (EDT)
-Received: from localhost (postgresql.org [])
- by postgresql.org (Postfix) with ESMTP
- id B9F51476570; Thu, 3 Oct 2002 18:18:21 -0400 (EDT)
-Received: from postgresql.org (postgresql.org [])
- by postgresql.org (Postfix) with SMTP
- id E083B4761B0; Thu, 3 Oct 2002 18:18:19 -0400 (EDT)
-Received: from localhost (postgresql.org [])
- by postgresql.org (Postfix) with ESMTP id 13ADC476063
- for <pgsql-hackers@postgresql.org>; Thu, 3 Oct 2002 18:18:17 -0400 (EDT)
-Received: from acorn.he.net (acorn.he.net [])
- by postgresql.org (Postfix) with ESMTP id 3AEC8475FFF
- for <pgsql-hackers@postgresql.org>; Thu, 3 Oct 2002 18:18:16 -0400 (EDT)
-Received: from CurtisVaio ([] (may be forged)) by acorn.he.net (8.8.6/8.8.2) with SMTP id PAA19215; Thu, 3 Oct 2002 15:18:14 -0700
-From: "Curtis Faith" <curtis@galtair.com>
-To: "Tom Lane" <tgl@sss.pgh.pa.us>
-cc: "Pgsql-Hackers" <pgsql-hackers@postgresql.org>
-Subject: Re: [HACKERS] Advice: Where could I be of help?
-Date: Thu, 3 Oct 2002 18:17:55 -0400
-Message-ID: <DMEEJMCDOJAKPPFACMPMGEBNCEAA.curtis@galtair.com>
-MIME-Version: 1.0
-Content-Type: text/plain;
- charset="iso-8859-1"
-Content-Transfer-Encoding: 7bit
-X-Priority: 3 (Normal)
-X-MSMail-Priority: Normal
-X-Mailer: Microsoft Outlook IMO, Build 9.0.2416 (9.0.2911.0)
-In-Reply-To: <13379.1033675158@sss.pgh.pa.us>
-X-MimeOLE: Produced By Microsoft MimeOLE V5.00.2919.6700
-Importance: Normal
-X-Virus-Scanned: by AMaViS new-20020517
-Precedence: bulk
-Sender: pgsql-hackers-owner@postgresql.org
-X-Virus-Scanned: by AMaViS new-20020517
-Status: OR
-tom lane wrote:
-> But more globally, I think that our worst problems these days have to do
-> with planner misestimations leading to bad plans. The planner is
-> usually *capable* of generating a good plan, but all too often it picks
-> the wrong one. We need work on improving the cost modeling equations
-> to be closer to reality. If that's at all close to your sphere of
-> interest then I think it should be #1 priority --- it's both localized,
-> which I think is important for a first project, and potentially a
-> considerable win.
-This seems like a very interesting problem. One of the ways that I thought
-would be interesting and would solve the problem of trying to figure out the
-right numbers is to have certain guesses for the actual values based on
-statistics gathered during vacuum and general running and then have the
-planner run the "best" plan.
-Then during execution if the planner turned out to be VERY wrong about
-certain assumptions the execution system could update the stats that led to
-those wrong assumptions. That way the system would seek the correct values
-automatically. We could also gather the stats that the system produces for
-certain actual databases and then use those to make smarter initial guesses.
-I've found that I can never predict costs. I always end up testing
-empirically and find myself surprised at the results.
-We should be able to make the executor smart enough to keep count of actual
-costs (or a statistical approximation) without introducing any significant
-tom lane also wrote:
-> There is no "cache flushing". We have a shared buffer cache management
-> algorithm that's straight LRU across all buffers. There's been some
-> interest in smarter cache-replacement code; I believe Neil Conway is
-> messing around with an LRU-2 implementation right now. If you've got
-> better ideas we're all ears.
-Hmmm, this is the area that I think could lead to huge performance gains.
-Consider a simple system with a table tbl_master that gets read by each
-process many times but with very infrequent inserts and that contains about
-3,000 rows. The single but heavily used index for this table is contained in
-a btree with a depth of three with 20 - 8K pages in the first two levels of
-the btree.
-Another table tbl_detail with 10 indices that gets very frequent inserts.
-There are over 300,000 rows. Some queries result in index scans over the
-approximatley 5,000 8K pages in the index.
-There is a 40M shared cache for this system.
-Everytime a query which requires the index scan runs it will blow out the
-entire cache since the scan will load more blocks than the cache holds. Only
-blocks that are accessed while the scan is going will survive. LRU is bad,
-bad, bad!
-LRU-2 might be better but it seems like it still won't give enough priority
-to the most frequently used blocks. I don't see how it would do better for
-the above case.
-I once implemented a modified cache algorithm that was based on the clock
-algorithm for VM page caches. VM paging is similar to databases in that
-there is definite locality of reference and certain pages are MUCH more
-likely to be requested.
-The basic idea was to have a flag in each block that represented the access
-time in clock intervals. Imagine a clock hand sweeping across a clock, every
-access is like a tiny movement in the clock hand. Blocks that are not
-accessed during a sweep are candidates for removal.
-My modification was to use access counts to increase the durability of the
-more accessed blocks. Each time a block is accessed it's flag is shifted
-left (up to a maximum number of shifts - ShiftN ) and 1 is added to it.
-Every so many cache accesses (and synchronously when the cache is full) a
-pass is made over each block, right shifting the flags (a clock sweep). This
-can also be done one block at a time each access so the clock is directly
-linked to the cache access rate. Any blocks with 0 are placed into a doubly
-linked list of candidates for removal. New cache blocks are allocated from
-the list of candidates. Accesses of blocks in the candidate list just
-removes them from the list.
-An index root node page would likely be accessed frequently enough so that
-all it's bits would be set so it would take ShiftN clock sweeps.
-This algorithm increased the cache hit ratio from 40% to about 90% for the
-cases I tested when compared to a simple LRU mechanism. The paging ratio is
-greatly dependent on the ratio of the actual database size to the cache
-The bottom line that it is very important to keep blocks that are frequently
-accessed in the cache. The top levels of large btrees are accessed many
-hundreds (actually a power of the number of keys in each page) of times more
-frequently than the leaf pages. LRU can be the worst possible algorithm for
-something like an index or table scan of large tables since it flushes a
-large number of potentially frequently accessed blocks in favor of ones that
-are very unlikely to be retrieved again.
-tom lane also wrote:
-> This is an interesting area. Keep in mind though that Postgres is a
-> portable DB that tries to be agnostic about what kernel and filesystem
-> it's sitting on top of --- and in any case it does not run as root, so
-> has very limited ability to affect what the kernel/filesystem do.
-> I'm not sure how much can be done without losing those portability
-> advantages.
-The kinds of things I was thinking about should be very portable. I found
-that simply writing the cache in order of the file system offset results in
-very greatly improved performance since it lets the head seek in smaller
-increments and much more smoothly, especially with modern disks. Most of the
-time the file system will create files are large sequential bytes on the
-physical disks in order. It might be in a few chunks but those chunks will
-be sequential and fairly large.
-tom lane also wrote:
-> Well, not really all that isolated. The bottom-level index code doesn't
-> know whether you're doing INSERT or UPDATE, and would have no easy
-> access to the original tuple if it did know. The original theory about
-> this was that the planner could detect the situation where the index(es)
-> don't overlap the set of columns being changed by the UPDATE, which
-> would be nice since there'd be zero runtime overhead. Unfortunately
-> that breaks down if any BEFORE UPDATE triggers are fired that modify the
-> tuple being stored. So all in all it turns out to be a tad messy to fit
-> this in :-(. I am unconvinced that the impact would be huge anyway,
-> especially as of 7.3 which has a shortcut path for dead index entries.
-Well, this probably is not the right place to start then.
-- Curtis
----------------------------(end of broadcast)---------------------------
-TIP 4: Don't 'kill -9' the postmaster
-From pgsql-hackers-owner+M29945@postgresql.org Thu Oct 3 18:47:34 2002
-Return-path: <pgsql-hackers-owner+M29945@postgresql.org>
-Received: from postgresql.org (postgresql.org [])
- by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g93MlWU26068
- for <pgman@candle.pha.pa.us>; Thu, 3 Oct 2002 18:47:32 -0400 (EDT)
-Received: from localhost (postgresql.org [])
- by postgresql.org (Postfix) with ESMTP
- id F2AAE476306; Thu, 3 Oct 2002 18:47:27 -0400 (EDT)
-Received: from postgresql.org (postgresql.org [])
- by postgresql.org (Postfix) with SMTP
- id E7B5247604F; Thu, 3 Oct 2002 18:47:24 -0400 (EDT)
-Received: from localhost (postgresql.org [])
- by postgresql.org (Postfix) with ESMTP id 9ADCC4761A1
- for <pgsql-hackers@postgresql.org>; Thu, 3 Oct 2002 18:47:18 -0400 (EDT)
-Received: from sss.pgh.pa.us (unknown [])
- by postgresql.org (Postfix) with ESMTP id DDB0B476187
- for <pgsql-hackers@postgresql.org>; Thu, 3 Oct 2002 18:47:17 -0400 (EDT)
-Received: from sss2.sss.pgh.pa.us (tgl@localhost [])
- by sss.pgh.pa.us (8.12.5/8.12.5) with ESMTP id g93MlIhR015091;
- Thu, 3 Oct 2002 18:47:18 -0400 (EDT)
-To: "Curtis Faith" <curtis@galtair.com>
-cc: "Pgsql-Hackers" <pgsql-hackers@postgresql.org>
-Subject: Re: [HACKERS] Advice: Where could I be of help?
-In-Reply-To: <DMEEJMCDOJAKPPFACMPMGEBNCEAA.curtis@galtair.com>
-References: <DMEEJMCDOJAKPPFACMPMGEBNCEAA.curtis@galtair.com>
-Comments: In-reply-to "Curtis Faith" <curtis@galtair.com>
- message dated "Thu, 03 Oct 2002 18:17:55 -0400"
-Date: Thu, 03 Oct 2002 18:47:17 -0400
-Message-ID: <15090.1033685237@sss.pgh.pa.us>
-From: Tom Lane <tgl@sss.pgh.pa.us>
-X-Virus-Scanned: by AMaViS new-20020517
-Precedence: bulk
-Sender: pgsql-hackers-owner@postgresql.org
-X-Virus-Scanned: by AMaViS new-20020517
-Status: OR
-"Curtis Faith" <curtis@galtair.com> writes:
-> Then during execution if the planner turned out to be VERY wrong about
-> certain assumptions the execution system could update the stats that led to
-> those wrong assumptions. That way the system would seek the correct values
-> automatically.
-That has been suggested before, but I'm unsure how to make it work.
-There are a lot of parameters involved in any planning decision and it's
-not obvious which ones to tweak, or in which direction, if the plan
-turns out to be bad. But if you can come up with some ideas, go to
-> Everytime a query which requires the index scan runs it will blow out the
-> entire cache since the scan will load more blocks than the cache
-> holds.
-Right, that's the scenario that kills simple LRU ...
-> LRU-2 might be better but it seems like it still won't give enough priority
-> to the most frequently used blocks.
-Blocks touched more than once per query (like the upper-level index
-blocks) will survive under LRU-2. Blocks touched once per query won't.
-Seems to me that it should be a win.
-> My modification was to use access counts to increase the durability of the
-> more accessed blocks.
-You could do it that way too, but I'm unsure whether the extra
-complexity will buy anything. Ultimately, I think an LRU-anything
-algorithm is equivalent to a clock sweep for those pages that only get
-touched once per some-long-interval: the single-touch guys get recycled
-in order of last use, which seems just like a clock sweep around the
-cache. The guys with some amount of preference get excluded from the
-once-around sweep. To determine whether LRU-2 is better or worse than
-some other preference algorithm requires a finer grain of analysis than
-this. I'm not a fan of "more complex must be better", so I'd want to see
-why it's better before buying into it ...
-> The kinds of things I was thinking about should be very portable. I found
-> that simply writing the cache in order of the file system offset results in
-> very greatly improved performance since it lets the head seek in smaller
-> increments and much more smoothly, especially with modern disks.
-Shouldn't the OS be responsible for scheduling those writes
-appropriately? Ye good olde elevator algorithm ought to handle this;
-and it's at least one layer closer to the actual disk layout than we
-are, thus more likely to issue the writes in a good order. It's worth
-experimenting with, perhaps, but I'm pretty dubious about it.
-BTW, one other thing that Vadim kept saying we should do is alter the
-cache management strategy to retain dirty blocks in memory (ie, give
-some amount of preference to as-yet-unwritten dirty pages compared to
-clean pages). There is no reliability cost here since the WAL will let
-us reconstruct any dirty pages if we crash before they get written; and
-the periodic checkpoints will ensure that we eventually write a dirty
-block and thus it will become available for recycling. This seems like
-a promising line of thought that's orthogonal to the basic
-LRU-vs-whatever issue. Nobody's got round to looking at it yet though.
-I've got no idea how much preference should be given to a dirty block
---- not infinite, probably, but some.
- regards, tom lane
----------------------------(end of broadcast)---------------------------
-TIP 5: Have you checked our extensive FAQ?
-From pgsql-hackers-owner+M29974@postgresql.org Fri Oct 4 01:28:54 2002
-Return-path: <pgsql-hackers-owner+M29974@postgresql.org>
-Received: from postgresql.org (postgresql.org [])
- by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g945SpU13476
- for <pgman@candle.pha.pa.us>; Fri, 4 Oct 2002 01:28:52 -0400 (EDT)
-Received: from localhost (postgresql.org [])
- by postgresql.org (Postfix) with ESMTP
- id 63999476BB2; Fri, 4 Oct 2002 01:26:56 -0400 (EDT)
-Received: from postgresql.org (postgresql.org [])
- by postgresql.org (Postfix) with SMTP
- id BB7CA476B85; Fri, 4 Oct 2002 01:26:54 -0400 (EDT)
-Received: from localhost (postgresql.org [])
- by postgresql.org (Postfix) with ESMTP id 5FD7E476759
- for <pgsql-hackers@postgresql.org>; Fri, 4 Oct 2002 01:26:52 -0400 (EDT)
-Received: from mclean.mail.mindspring.net (mclean.mail.mindspring.net [])
- by postgresql.org (Postfix) with ESMTP id 1F4A14766D8
- for <pgsql-hackers@postgresql.org>; Fri, 4 Oct 2002 01:26:51 -0400 (EDT)
-Received: from 1cust163.tnt1.st-thomas.vi.da.uu.net ([] helo=CurtisVaio)
- by mclean.mail.mindspring.net with smtp (Exim 3.33 #1)
- id 17xKzB-0000yK-00; Fri, 04 Oct 2002 01:26:49 -0400
-From: "Curtis Faith" <curtis@galtair.com>
-To: "Tom Lane" <tgl@sss.pgh.pa.us>
-cc: "Pgsql-Hackers" <pgsql-hackers@postgresql.org>
-Subject: Re: [HACKERS] Advice: Where could I be of help?
-Date: Fri, 4 Oct 2002 01:26:36 -0400
-Message-ID: <DMEEJMCDOJAKPPFACMPMIECECEAA.curtis@galtair.com>
-MIME-Version: 1.0
-Content-Type: text/plain;
- charset="iso-8859-1"
-Content-Transfer-Encoding: 7bit
-X-Priority: 3 (Normal)
-X-MSMail-Priority: Normal
-X-Mailer: Microsoft Outlook IMO, Build 9.0.2416 (9.0.2911.0)
-In-Reply-To: <15090.1033685237@sss.pgh.pa.us>
-X-MimeOLE: Produced By Microsoft MimeOLE V5.00.2919.6700
-Importance: Normal
-X-Virus-Scanned: by AMaViS new-20020517
-Precedence: bulk
-Sender: pgsql-hackers-owner@postgresql.org
-X-Virus-Scanned: by AMaViS new-20020517
-Status: OR
-I wrote:
-> > My modification was to use access counts to increase the
-> durability of the
-> > more accessed blocks.
-tom lane replies:
-> You could do it that way too, but I'm unsure whether the extra
-> complexity will buy anything. Ultimately, I think an LRU-anything
-> algorithm is equivalent to a clock sweep for those pages that only get
-> touched once per some-long-interval: the single-touch guys get recycled
-> in order of last use, which seems just like a clock sweep around the
-> cache. The guys with some amount of preference get excluded from the
-> once-around sweep. To determine whether LRU-2 is better or worse than
-> some other preference algorithm requires a finer grain of analysis than
-> this. I'm not a fan of "more complex must be better", so I'd want to see
-> why it's better before buying into it ...
-I'm definitely not a fan of "more complex must be better either". In fact,
-its surprising how often the real performance problems are easy to fix
-and simple while many person years are spent solving the issue everyone
-"knows" must be causing the performance problems only to find little gain.
-The key here is empirical testing. If the cache hit ratio for LRU-2 is
-much better then there may be no need here. OTOH, it took less than
-less than 30 lines or so of code to do what I described, so I don't consider
-it too, too "more complex" :=} We should run a test which includes
-running indexes (or is indices the PostgreSQL convention?) that are three
-or more times the size of the cache to see how well LRU-2 works. Is there
-any cache performance reporting built into pgsql?
-tom lane wrote:
-> Shouldn't the OS be responsible for scheduling those writes
-> appropriately? Ye good olde elevator algorithm ought to handle this;
-> and it's at least one layer closer to the actual disk layout than we
-> are, thus more likely to issue the writes in a good order. It's worth
-> experimenting with, perhaps, but I'm pretty dubious about it.
-I wasn't proposing anything other than changing the order of the writes,
-not actually ensuring that they get written that way at the level you
-describe above. This will help a lot on brain-dead file systems that
-can't do this ordering and probably also in cases where the number
-of blocks in the cache is very large.
-On a related note, while looking at the code, it seems to me that we
-are writing out the buffer cache synchronously, so there won't be
-any possibility of the file system reordering anyway. This appears to be
-a huge performance problem. I've read claims in the archives that
-that the buffers are written asynchronously but my read of the
-code says otherwise. Can someone point out my error?
-I only see calls that ultimately call FileWrite or write(2) which will
-block without a O_NOBLOCK open. I thought one of the main reasons
-for having a WAL is so that you can write out the buffer's asynchronously.
-What am I missing?
-I wrote:
-> > Then during execution if the planner turned out to be VERY wrong about
-> > certain assumptions the execution system could update the stats
-> that led to
-> > those wrong assumptions. That way the system would seek the
-> correct values
-> > automatically.
-tom lane replied:
-> That has been suggested before, but I'm unsure how to make it work.
-> There are a lot of parameters involved in any planning decision and it's
-> not obvious which ones to tweak, or in which direction, if the plan
-> turns out to be bad. But if you can come up with some ideas, go to
-> it!
-I'll have to look at the current planner before I can suggest
-anything concrete.
-- Curtis
----------------------------(end of broadcast)---------------------------
-TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
+++ /dev/null
-From owner-pgsql-hackers@hub.org Mon May 11 11:31:09 1998
-Received: from renoir.op.net (root@renoir.op.net [])
- by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id LAA03006
- for <maillist@candle.pha.pa.us>; Mon, 11 May 1998 11:31:07 -0400 (EDT)
-Received: from hub.org (hub.org []) by renoir.op.net (o1/$ Revision: 1.17 $) with ESMTP id LAA01663 for <maillist@candle.pha.pa.us>; Mon, 11 May 1998 11:24:42 -0400 (EDT)
-Received: from localhost (majordom@localhost) by hub.org (8.8.8/8.7.5) with SMTP id LAA21841; Mon, 11 May 1998 11:15:25 -0400 (EDT)
-Received: by hub.org (TLB v0.10a (1.23 tibbs 1997/01/09 00:29:32)); Mon, 11 May 1998 11:15:12 +0000 (EDT)
-Received: (from majordom@localhost) by hub.org (8.8.8/8.7.5) id LAA21683 for pgsql-hackers-outgoing; Mon, 11 May 1998 11:15:09 -0400 (EDT)
-Received: from sss.sss.pgh.pa.us (sss.pgh.pa.us []) by hub.org (8.8.8/8.7.5) with ESMTP id LAA21451 for <hackers@postgreSQL.org>; Mon, 11 May 1998 11:15:03 -0400 (EDT)
-Received: from sss.sss.pgh.pa.us (localhost [])
- by sss.sss.pgh.pa.us (8.8.5/8.8.5) with ESMTP id LAA24915;
- Mon, 11 May 1998 11:14:43 -0400 (EDT)
-To: Brett McCormick <brett@work.chicken.org>
-cc: hackers@postgreSQL.org
-Subject: Re: [HACKERS] Re: [PATCHES] Try again: S_LOCK reduced contentionh]
-In-reply-to: Your message of Mon, 11 May 1998 07:57:23 -0700 (PDT)
- <13655.4384.345723.466046@abraxas.scene.com>
-Date: Mon, 11 May 1998 11:14:43 -0400
-Message-ID: <24913.894899683@sss.pgh.pa.us>
-From: Tom Lane <tgl@sss.pgh.pa.us>
-Sender: owner-pgsql-hackers@hub.org
-Precedence: bulk
-Status: RO
-Brett McCormick <brett@work.chicken.org> writes:
-> same way that the current network socket is passed -- through an execv
-> argument. hopefully, however, the non-execv()ing fork will be in 6.4.
-Um, you missed the point, Brett. David was hoping to transfer a client
-connection from the postmaster to an *already existing* backend process.
-Fork, with or without exec, solves the problem for a backend that's
-started after the postmaster has accepted the client socket.
-This does lead to a different line of thought, however. Pre-started
-backends would have access to the "master" connection socket on which
-the postmaster listens for client connections, right? Suppose that we
-fire the postmaster as postmaster, and demote it to being simply a
-manufacturer of new backend processes as old ones get used up. Have
-one of the idle backend processes be the one doing the accept() on the
-master socket. Once it has a client connection, it performs the
-authentication handshake and then starts serving the client (or just
-quits if authentication fails). Meanwhile the next idle backend process
-has executed accept() on the master socket and is waiting for the next
-client; and shortly the postmaster/factory/whateverwecallitnow notices
-that it needs to start another backend to add to the idle-backend pool.
-This'd probably need some interlocking among the backends. I have no
-idea whether it'd be safe to have all the idle backends trying to
-do accept() on the master socket simultaneously, but it sounds risky.
-Better to use a mutex so that only one gets to do it while the others
- regards, tom lane
-From owner-pgsql-hackers@hub.org Mon May 11 11:35:55 1998
-Received: from hub.org (hub.org [])
- by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id LAA03043
- for <maillist@candle.pha.pa.us>; Mon, 11 May 1998 11:35:53 -0400 (EDT)
-Received: from localhost (majordom@localhost) by hub.org (8.8.8/8.7.5) with SMTP id LAA23494; Mon, 11 May 1998 11:27:10 -0400 (EDT)
-Received: by hub.org (TLB v0.10a (1.23 tibbs 1997/01/09 00:29:32)); Mon, 11 May 1998 11:27:02 +0000 (EDT)
-Received: (from majordom@localhost) by hub.org (8.8.8/8.7.5) id LAA23473 for pgsql-hackers-outgoing; Mon, 11 May 1998 11:27:01 -0400 (EDT)
-Received: from sss.sss.pgh.pa.us (sss.pgh.pa.us []) by hub.org (8.8.8/8.7.5) with ESMTP id LAA23462 for <hackers@postgreSQL.org>; Mon, 11 May 1998 11:26:56 -0400 (EDT)
-Received: from sss.sss.pgh.pa.us (localhost [])
- by sss.sss.pgh.pa.us (8.8.5/8.8.5) with ESMTP id LAA25006;
- Mon, 11 May 1998 11:26:44 -0400 (EDT)
-To: Brett McCormick <brett@work.chicken.org>
-cc: hackers@postgreSQL.org
-Subject: Re: [HACKERS] Re: [PATCHES] Try again: S_LOCK reduced contentionh]
-In-reply-to: Your message of Mon, 11 May 1998 07:57:23 -0700 (PDT)
- <13655.4384.345723.466046@abraxas.scene.com>
-Date: Mon, 11 May 1998 11:26:44 -0400
-Message-ID: <25004.894900404@sss.pgh.pa.us>
-From: Tom Lane <tgl@sss.pgh.pa.us>
-Sender: owner-pgsql-hackers@hub.org
-Precedence: bulk
-Status: RO
-Meanwhile, *I* missed the point about Brett's second comment :-(
-Brett McCormick <brett@work.chicken.org> writes:
-> There will have to be some sort of arg parsing in any case,
-> considering that you can pass configurable arguments to the backend..
-If we do the sort of change David and I were just discussing, then the
-pre-spawned backend would become responsible for parsing and dealing
-with the PGOPTIONS portion of the client's connection request message.
-That's just part of shifting the authentication handshake code from
-postmaster to backend, so it shouldn't be too hard.
-BUT: the whole point is to be able to initialize the backend before it
-is connected to a client. How much of the expensive backend startup
-work depends on having the client connection options available?
-Any work that needs to know the options will have to wait until after
-the client connects. If that means most of the startup work can't
-happen in advance anyway, then we're out of luck; a pre-started backend
-won't save enough time to be worth the effort. (Unless we are willing
-to eliminate or redefine the troublesome options...)
- regards, tom lane
---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
+From owner-pgsql-hackers@hub.org Mon May 11 11:31:09 1998
+Received: from renoir.op.net (root@renoir.op.net [])
+ by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id LAA03006
+ for <maillist@candle.pha.pa.us>; Mon, 11 May 1998 11:31:07 -0400 (EDT)
+Received: from hub.org (hub.org []) by renoir.op.net (o1/$ Revision: 1.17 $) with ESMTP id LAA01663 for <maillist@candle.pha.pa.us>; Mon, 11 May 1998 11:24:42 -0400 (EDT)
+Received: from localhost (majordom@localhost) by hub.org (8.8.8/8.7.5) with SMTP id LAA21841; Mon, 11 May 1998 11:15:25 -0400 (EDT)
+Received: by hub.org (TLB v0.10a (1.23 tibbs 1997/01/09 00:29:32)); Mon, 11 May 1998 11:15:12 +0000 (EDT)
+Received: (from majordom@localhost) by hub.org (8.8.8/8.7.5) id LAA21683 for pgsql-hackers-outgoing; Mon, 11 May 1998 11:15:09 -0400 (EDT)
+Received: from sss.sss.pgh.pa.us (sss.pgh.pa.us []) by hub.org (8.8.8/8.7.5) with ESMTP id LAA21451 for <hackers@postgreSQL.org>; Mon, 11 May 1998 11:15:03 -0400 (EDT)
+Received: from sss.sss.pgh.pa.us (localhost [])
+ by sss.sss.pgh.pa.us (8.8.5/8.8.5) with ESMTP id LAA24915;
+ Mon, 11 May 1998 11:14:43 -0400 (EDT)
+To: Brett McCormick <brett@work.chicken.org>
+cc: hackers@postgreSQL.org
+Subject: Re: [HACKERS] Re: [PATCHES] Try again: S_LOCK reduced contentionh]
+In-reply-to: Your message of Mon, 11 May 1998 07:57:23 -0700 (PDT)
+ <13655.4384.345723.466046@abraxas.scene.com>
+Date: Mon, 11 May 1998 11:14:43 -0400
+Message-ID: <24913.894899683@sss.pgh.pa.us>
+From: Tom Lane <tgl@sss.pgh.pa.us>
+Sender: owner-pgsql-hackers@hub.org
+Precedence: bulk
+Status: RO
+Brett McCormick <brett@work.chicken.org> writes:
+> same way that the current network socket is passed -- through an execv
+> argument. hopefully, however, the non-execv()ing fork will be in 6.4.
+Um, you missed the point, Brett. David was hoping to transfer a client
+connection from the postmaster to an *already existing* backend process.
+Fork, with or without exec, solves the problem for a backend that's
+started after the postmaster has accepted the client socket.
+This does lead to a different line of thought, however. Pre-started
+backends would have access to the "master" connection socket on which
+the postmaster listens for client connections, right? Suppose that we
+fire the postmaster as postmaster, and demote it to being simply a
+manufacturer of new backend processes as old ones get used up. Have
+one of the idle backend processes be the one doing the accept() on the
+master socket. Once it has a client connection, it performs the
+authentication handshake and then starts serving the client (or just
+quits if authentication fails). Meanwhile the next idle backend process
+has executed accept() on the master socket and is waiting for the next
+client; and shortly the postmaster/factory/whateverwecallitnow notices
+that it needs to start another backend to add to the idle-backend pool.
+This'd probably need some interlocking among the backends. I have no
+idea whether it'd be safe to have all the idle backends trying to
+do accept() on the master socket simultaneously, but it sounds risky.
+Better to use a mutex so that only one gets to do it while the others
+ regards, tom lane
+From owner-pgsql-hackers@hub.org Mon May 11 11:35:55 1998
+Received: from hub.org (hub.org [])
+ by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id LAA03043
+ for <maillist@candle.pha.pa.us>; Mon, 11 May 1998 11:35:53 -0400 (EDT)
+Received: from localhost (majordom@localhost) by hub.org (8.8.8/8.7.5) with SMTP id LAA23494; Mon, 11 May 1998 11:27:10 -0400 (EDT)
+Received: by hub.org (TLB v0.10a (1.23 tibbs 1997/01/09 00:29:32)); Mon, 11 May 1998 11:27:02 +0000 (EDT)
+Received: (from majordom@localhost) by hub.org (8.8.8/8.7.5) id LAA23473 for pgsql-hackers-outgoing; Mon, 11 May 1998 11:27:01 -0400 (EDT)
+Received: from sss.sss.pgh.pa.us (sss.pgh.pa.us []) by hub.org (8.8.8/8.7.5) with ESMTP id LAA23462 for <hackers@postgreSQL.org>; Mon, 11 May 1998 11:26:56 -0400 (EDT)
+Received: from sss.sss.pgh.pa.us (localhost [])
+ by sss.sss.pgh.pa.us (8.8.5/8.8.5) with ESMTP id LAA25006;
+ Mon, 11 May 1998 11:26:44 -0400 (EDT)
+To: Brett McCormick <brett@work.chicken.org>
+cc: hackers@postgreSQL.org
+Subject: Re: [HACKERS] Re: [PATCHES] Try again: S_LOCK reduced contentionh]
+In-reply-to: Your message of Mon, 11 May 1998 07:57:23 -0700 (PDT)
+ <13655.4384.345723.466046@abraxas.scene.com>
+Date: Mon, 11 May 1998 11:26:44 -0400
+Message-ID: <25004.894900404@sss.pgh.pa.us>
+From: Tom Lane <tgl@sss.pgh.pa.us>
+Sender: owner-pgsql-hackers@hub.org
+Precedence: bulk
+Status: RO
+Meanwhile, *I* missed the point about Brett's second comment :-(
+Brett McCormick <brett@work.chicken.org> writes:
+> There will have to be some sort of arg parsing in any case,
+> considering that you can pass configurable arguments to the backend..
+If we do the sort of change David and I were just discussing, then the
+pre-spawned backend would become responsible for parsing and dealing
+with the PGOPTIONS portion of the client's connection request message.
+That's just part of shifting the authentication handshake code from
+postmaster to backend, so it shouldn't be too hard.
+BUT: the whole point is to be able to initialize the backend before it
+is connected to a client. How much of the expensive backend startup
+work depends on having the client connection options available?
+Any work that needs to know the options will have to wait until after
+the client connects. If that means most of the startup work can't
+happen in advance anyway, then we're out of luck; a pre-started backend
+won't save enough time to be worth the effort. (Unless we are willing
+to eliminate or redefine the troublesome options...)
+ regards, tom lane
+++ /dev/null
-From owner-pgsql-hackers@hub.org Wed Nov 18 14:40:49 1998
-Received: from hub.org (majordom@hub.org [])
- by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id OAA29743
- for <maillist@candle.pha.pa.us>; Wed, 18 Nov 1998 14:40:36 -0500 (EST)
-Received: from localhost (majordom@localhost)
- by hub.org (8.9.1/8.9.1) with SMTP id OAA03716;
- Wed, 18 Nov 1998 14:37:04 -0500 (EST)
- (envelope-from owner-pgsql-hackers@hub.org)
-Received: by hub.org (TLB v0.10a (1.23 tibbs 1997/01/09 00:29:32)); Wed, 18 Nov 1998 14:34:39 +0000 (EST)
-Received: (from majordom@localhost)
- by hub.org (8.9.1/8.9.1) id OAA03395
- for pgsql-hackers-outgoing; Wed, 18 Nov 1998 14:34:37 -0500 (EST)
- (envelope-from owner-pgsql-hackers@postgreSQL.org)
-Received: from orion.SAPserv.Hamburg.dsh.de (Tpolaris2.sapham.debis.de [])
- by hub.org (8.9.1/8.9.1) with SMTP id OAA03381
- for <pgsql-hackers@hub.org>; Wed, 18 Nov 1998 14:34:31 -0500 (EST)
- (envelope-from wieck@sapserv.debis.de)
-Received: by orion.SAPserv.Hamburg.dsh.de
- for pgsql-hackers@hub.org
- id m0zgDnj-000EBTC; Wed, 18 Nov 98 21:02 MET
-Message-Id: <m0zgDnj-000EBTC@orion.SAPserv.Hamburg.dsh.de>
-From: jwieck@debis.com (Jan Wieck)
-To: meskes@usa.net (Michael Meskes)
-Date: Wed, 18 Nov 1998 21:02:06 +0100 (MET)
-Cc: pgsql-hackers@hub.org
-Reply-To: jwieck@debis.com (Jan Wieck)
-In-Reply-To: <19981118084843.B869@usa.net> from "Michael Meskes" at Nov 18, 98 08:48:43 am
-X-Mailer: ELM [version 2.4 PL25]
-Content-Type: text
-Sender: owner-pgsql-hackers@postgreSQL.org
-Precedence: bulk
-Status: RO
-Michael Meskes wrote:
-> On Wed, Nov 18, 1998 at 03:23:30AM +0000, Thomas G. Lockhart wrote:
-> > > I didn't get this one completly. What input do you mean?
-> >
-> > Just the original string/query to be prepared...
-> I see. But wouldn't it be more useful to preprocess the query and store the
-> resulting nodes instead? We don't want to parse the statement everytime a
-> variable binding comes in.
- Right. A real improvement would only be to have the prepared
- execution plan in the backend and just giving the parameter
- values.
- I can think of the following construct:
- PREPARE optimizable-statement;
- That one will run parser/rewrite/planner, create a new memory
- context with a unique identifier and saves the querytree's
- and plan's in it. Parameter values are identified by the
- usual $n notation. The command returns the identifier.
- EXECUTE QUERY identifier [value [, ...]];
- then get's back the prepared plan and querytree by the id,
- creates an executor context with the given values in the
- parameter array and calls ExecutorRun() for them.
- The PREPARE needs to analyze the resulting parsetrees to get
- the datatypes (and maybe atttypmod's) of the parameters, so
- EXECUTE QUERY can convert the values into Datum's using the
- types input functions. And the EXECUTE has to be handled
- special in tcop (it's something between a regular query and
- an utility statement). But it's not too hard to implement.
- Finally a
- FORGET QUERY identifier;
- (don't remember how the others named it) will remove the
- prepared plan etc. simply by destroying the memory context
- and dropping the identifier from the id->mcontext+prepareinfo
- mapping.
- This all restricts the usage of PREPARE to optimizable
- statements. Is it required to be able to prepare utility
- statements (like CREATE TABLE or so) too?
-# It's easier to get forgiveness for being wrong than for being right. #
-# Let's break this rule - forgive me. #
-#======================================== jwieck@debis.com (Jan Wieck) #
-From pgsql-hackers-owner+M67@postgresql.org Tue Oct 31 19:18:16 2000
-Received: from mail.postgresql.org ([])
- by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id TAA08916
- for <pgman@candle.pha.pa.us>; Tue, 31 Oct 2000 19:18:15 -0500 (EST)
-Received: from mail.postgresql.org ([])
- by mail.postgresql.org (8.11.1/8.11.1) with SMTP id eA10IOl60635;
- Tue, 31 Oct 2000 19:18:24 -0500 (EST)
- (envelope-from pgsql-hackers-owner+M67@postgresql.org)
-Received: from ara.zf.jcu.cz (ara.zf.jcu.cz [])
- by mail.postgresql.org (8.11.1/8.11.1) with ESMTP id eA10H8l60400
- for <pgsql-hackers@postgresql.org>; Tue, 31 Oct 2000 19:17:08 -0500 (EST)
- (envelope-from zakkr@zf.jcu.cz)
-Received: from localhost (zakkr@localhost)
- by ara.zf.jcu.cz (8.9.3/8.9.3/Debian 8.9.3-21) with SMTP id BAA32036;
- Wed, 1 Nov 2000 01:16:42 +0100
-Date: Wed, 1 Nov 2000 01:16:42 +0100 (CET)
-From: Karel Zak <zakkr@zf.jcu.cz>
-To: Alfred Perlstein <bright@wintelcom.net>
-cc: pgsql-hackers@postgresql.org
-Subject: Re: [HACKERS] Query cache import?
-In-Reply-To: <20001031151144.F22110@fw.wintelcom.net>
-Message-ID: <Pine.LNX.3.96.1001101005110.31713B-100000@ara.zf.jcu.cz>
-MIME-Version: 1.0
-Content-Type: TEXT/PLAIN; charset=US-ASCII
-Precedence: bulk
-Sender: pgsql-hackers-owner@postgresql.org
-Status: OR
-On Tue, 31 Oct 2000, Alfred Perlstein wrote:
-> I never saw much traffic regarding Karel's work on making stored
-> proceedures:
-> http://people.freebsd.org/~alfred/karel-pgsql.txt
-> What happened with this? It looked pretty interesting. :(
- It's probably a little about me :-) ... well,
- My query cache is in usable state and it's efficient for all
-things those motivate me to work on this.
- some basic features:
- - share parsed plans between backends in shared memory
- - store plans to private backend hash table
- - use parameters for stored queries
- - better design for SPI
- - memory usage for saved plans
- - save plans "by key"
- The current query cache code depend on 7.1 memory management. After
-official 7.1 release I prepare patch with query cache+SPI (if not
-hit me over head, please ..)
- All what will doing next time not depend on me, *it's on code developers*.
- For example Jan has interesting idea about caching all plans which
-processing backend. But it's far future and IMHO we must go by small
-steps to Oracle's funeral :-)
- If I need the query cache in the my work (typical for some web+pgsql) or
-will some public interest I will continue on this, if not I freeze it.
-(Exists more interesting work like http://mape.jcu.cz ... sorry of
-advertising :-)
- Karel
-From pgsql-hackers-owner+M312@postgresql.org Mon Nov 6 03:27:32 2000
-Received: from mail.postgresql.org (webmail.postgresql.org [])
- by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id DAA28404
- for <pgman@candle.pha.pa.us>; Mon, 6 Nov 2000 03:27:32 -0500 (EST)
-Received: from mail.postgresql.org (webmail.postgresql.org [])
- by mail.postgresql.org (8.11.1/8.11.1) with SMTP id eA68Pos51966;
- Mon, 6 Nov 2000 03:25:50 -0500 (EST)
- (envelope-from pgsql-hackers-owner+M312@postgresql.org)
-Received: from ara.zf.jcu.cz (ara.zf.jcu.cz [])
- by mail.postgresql.org (8.11.1/8.11.1) with ESMTP id eA68Fes50414
- for <pgsql-hackers@postgresql.org>; Mon, 6 Nov 2000 03:15:40 -0500 (EST)
- (envelope-from zakkr@zf.jcu.cz)
-Received: from localhost (zakkr@localhost)
- by ara.zf.jcu.cz (8.9.3/8.9.3/Debian 8.9.3-21) with SMTP id JAA20862;
- Mon, 6 Nov 2000 09:15:04 +0100
-Date: Mon, 6 Nov 2000 09:15:04 +0100 (CET)
-From: Karel Zak <zakkr@zf.jcu.cz>
-To: Christof Petig <christof.petig@wtal.de>
-cc: Zeugswetter Andreas SB <ZeugswetterA@wien.spardat.at>,
- The Hermit Hacker <scrappy@hub.org>, pgsql-hackers@postgresql.org
-Subject: Re: AW: [HACKERS] Re: [GENERAL] Query caching
-In-Reply-To: <3A02DDFF.E8CBFCF3@wtal.de>
-Message-ID: <Pine.LNX.3.96.1001106090801.20612C-100000@ara.zf.jcu.cz>
-MIME-Version: 1.0
-Content-Type: TEXT/PLAIN; charset=US-ASCII
-Precedence: bulk
-Sender: pgsql-hackers-owner@postgresql.org
-Status: OR
-On Fri, 3 Nov 2000, Christof Petig wrote:
-> Karel Zak wrote:
-> > On Thu, 2 Nov 2000, Zeugswetter Andreas SB wrote:
-> >
-> > >
-> > > > Well I can re-write and resubmit this patch. Add it as a
-> > > > compile time option
-> > > > is not bad idea. Second possibility is distribute it as patch
-> > > > in the contrib
-> > > > tree. And if it until not good tested not dirty with this main tree...
-> > > >
-> > > > Ok, I next week prepare it...
-> > >
-> > > One thing that worries me though is, that it extends the sql language,
-> > > and there has been no discussion about the chosen syntax.
-> > >
-> > > Imho the standard embedded SQL syntax (prepare ...) could be a
-> > > starting point.
-> >
-> > Yes, you are right... my PREPARE/EXECUTE is not too much ready to SQL92,
-> > I some old letter I speculate about "SAVE/EXECUTE PLAN" instead
-> > PREPARE/EXECUTE. But don't forget, it will *experimental* patch... we can
-> > change it in future ..etc.
-> >
-> > Karel
-> [Sorry, I didn't look into your patch, yet.]
- Please, read my old query cache and PREPARE/EXECUTE description...
-> What about parameters? Normally you can prepare a statement and execute it
- We have in PG parameters, see SPI, but now it's used inside backend only
-and not exist statement that allows to use this feature in be<->fe.
-> using different parameters. AFAIK postgres' frontend-backend protocol is not
-> designed to take parameters for statements (e.g. like result presents
-> results). A very long road to go.
-> By the way, I'm somewhat interested in getting this feature in. Perhaps it
-> should be part of a protocol redesign (e.g. binary parameters/results).
-> Handling endianness is one aspect, floats are harder (but float->ascii->float
-> sometimes fails as well).
- PREPARE <name> AS <query>
- [ USING type, ... typeN ]
- EXECUTE <name>
- [ INTO [ TEMPORARY | TEMP ] [ TABLE ] new_table ]
- [ USING val, ... valN ]
- [ <name> [ NOSHARE | SHARE | GLOBAL ]]
-An example:
-PREPARE chris_query AS SELECT * FROM pg_class WHERE relname = $1 USING text;
-EXECUTE chris_query USING 'pg_shadow';
- Or mean you something other?
- Karel
-From pgsql-hackers-owner+M444@postgresql.org Thu Nov 9 03:32:10 2000
-Received: from mail.postgresql.org (webmail.postgresql.org [])
- by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id DAA09953
- for <pgman@candle.pha.pa.us>; Thu, 9 Nov 2000 03:32:09 -0500 (EST)
-Received: from mail.postgresql.org (webmail.postgresql.org [])
- by mail.postgresql.org (8.11.1/8.11.1) with SMTP id eA98RSs11426;
- Thu, 9 Nov 2000 03:27:28 -0500 (EST)
- (envelope-from pgsql-hackers-owner+M444@postgresql.org)
-Received: from ara.zf.jcu.cz (ara.zf.jcu.cz [])
- by mail.postgresql.org (8.11.1/8.11.1) with ESMTP id eA98OPs11045;
- Thu, 9 Nov 2000 03:24:25 -0500 (EST)
- (envelope-from zakkr@zf.jcu.cz)
-Received: from localhost (zakkr@localhost)
- by ara.zf.jcu.cz (8.9.3/8.9.3/Debian 8.9.3-21) with SMTP id JAA08951;
- Thu, 9 Nov 2000 09:23:41 +0100
-Date: Thu, 9 Nov 2000 09:23:41 +0100 (CET)
-From: Karel Zak <zakkr@zf.jcu.cz>
-To: Christof Petig <christof.petig@wtal.de>
-cc: PostgreSQL Hackers <pgsql-hackers@postgresql.org>,
- Michael Meskes <meskes@postgresql.org>,
- Zeugswetter Andreas SB <ZeugswetterA@wien.spardat.at>,
- The Hermit Hacker <scrappy@hub.org>
-Subject: Re: AW: [HACKERS] Re: [GENERAL] Query caching
-In-Reply-To: <3A096BCE.F9887955@wtal.de>
-Message-ID: <Pine.LNX.3.96.1001109090739.8052B-100000@ara.zf.jcu.cz>
-MIME-Version: 1.0
-Content-Type: TEXT/PLAIN; charset=US-ASCII
-Precedence: bulk
-Sender: pgsql-hackers-owner@postgresql.org
-Status: OR
-On Wed, 8 Nov 2000, Christof Petig wrote:
-> Karel Zak wrote:
-> > > What about parameters? Normally you can prepare a statement and execute it
-> >
-> > We have in PG parameters, see SPI, but now it's used inside backend only
-> > and not exist statement that allows to use this feature in be<->fe.
-> Sad. Since ecpg would certainly benefit from this.
-> > > using different parameters. AFAIK postgres' frontend-backend protocol is not
-> > > designed to take parameters for statements (e.g. like result presents
-> > > results). A very long road to go.
-> > > By the way, I'm somewhat interested in getting this feature in. Perhaps it
-> > > should be part of a protocol redesign (e.g. binary parameters/results).
-> > > Handling endianness is one aspect, floats are harder (but float->ascii->float
-> > > sometimes fails as well).
-> >
-> > PREPARE <name> AS <query>
-> > [ USING type, ... typeN ]
-> >
-> > EXECUTE <name>
-> > [ INTO [ TEMPORARY | TEMP ] [ TABLE ] new_table ]
-> > [ USING val, ... valN ]
-> >
-> > [ <name> [ NOSHARE | SHARE | GLOBAL ]]
-> >
-> > An example:
-> >
-> > PREPARE chris_query AS SELECT * FROM pg_class WHERE relname = $1 USING text;
-> I would prefer '?' as a parameter name, since this is in the embedded sql standard
-> (do you have a copy of the 94 draft? I can mail mine to you?)
- This not depend on query cache. The '$n' is PostgreSQL query parametr
-keyword and is defined in standard parser. The PREPARE statement not parsing
-query it's job for standard parser.
-> Also the standard says a whole lot about guessing the parameter's type.
-> Also I vote for ?::type or type(?) or sql's cast(...) (don't know it's syntax)
-> instead of abusing the using keyword.
-The postgresql executor expect types of parametrs in separate input (array).
-I not sure how much expensive/executable is survey it from query.
-> > EXECUTE chris_query USING 'pg_shadow';
-> Great idea of yours to implement this! Since I was thinking about implementing a
-> more decent schema for ecpg but had no mind to touch the backend and be-fe
-> protocol (yet).
-> It would be desirable to do an 'execute immediate using', since using input
-> parameters would take a lot of code away from ecpg.
-By the way, PREPARE/EXECUTE is face only. More interesting in this period is
-query-cache-kernel. SQL92 is really a little unlike my PREPARE/EXECUTE.
- Karel
-From pgsql-hackers-owner+M9563@postgresql.org Thu May 31 16:31:59 2001
-Return-path: <pgsql-hackers-owner+M9563@postgresql.org>
-Received: from postgresql.org (webmail.postgresql.org [])
- by candle.pha.pa.us (8.10.1/8.10.1) with ESMTP id f4VKVxc26942
- for <pgman@candle.pha.pa.us>; Thu, 31 May 2001 16:31:59 -0400 (EDT)
-Received: from postgresql.org.org (webmail.postgresql.org [])
- by postgresql.org (8.11.3/8.11.1) with SMTP id f4VKVIE38645;
- Thu, 31 May 2001 16:31:18 -0400 (EDT)
- (envelope-from pgsql-hackers-owner+M9563@postgresql.org)
-Received: from ara.zf.jcu.cz (ara.zf.jcu.cz [])
- by postgresql.org (8.11.3/8.11.1) with ESMTP id f4VKNVE35356
- for <pgsql-hackers@postgresql.org>; Thu, 31 May 2001 16:23:31 -0400 (EDT)
- (envelope-from zakkr@zf.jcu.cz)
-Received: (from zakkr@localhost)
- by ara.zf.jcu.cz (8.9.3/8.9.3/Debian 8.9.3-21) id WAA19957;
- Thu, 31 May 2001 22:23:26 +0200
-Date: Thu, 31 May 2001 22:23:26 +0200
-From: Karel Zak <zakkr@zf.jcu.cz>
-To: Roberto Abalde <roberto.abalde@galego21.org>
-cc: pgsql-hackers@postgresql.org
-Subject: Re: [HACKERS] Cache for query plans
-Message-ID: <20010531222326.B16862@ara.zf.jcu.cz>
-References: <000701c0e932$d17646c0$c6023dc8@ultra>
-MIME-Version: 1.0
-Content-Type: text/plain; charset=us-ascii
-User-Agent: Mutt/1.0.1i
-In-Reply-To: <000701c0e932$d17646c0$c6023dc8@ultra>; from roberto.abalde@galego21.org on Wed, May 30, 2001 at 03:00:53PM -0300
-Precedence: bulk
-Sender: pgsql-hackers-owner@postgresql.org
-Status: ORr
-On Wed, May 30, 2001 at 03:00:53PM -0300, Roberto Abalde wrote:
-> Hi,
-> I need to implement a cache for query plans as part of my BSc thesis. Does
-> anybody know what happened to Karel Zak's patch?
- Hi,
- my patch is on my ftp and nobody works on it, but I mean it's good
-begin for some next work. I not sure with implement this experimental
-patch (but usable) to official sources. For example Jan has more complex
-idea about query plan cache ... but first time we must solve some
-sub-problems like memory management in shared memory that is transparently
-for starndard routines like copy query plan ... and Tom isn't sure with
-query cache in shared memory...etc. Too much queries, but less answers :-)
- Karel
-> PS: Sorry for my english :(
- Do you anytime read any my mail :-)
- Karel
- Karel Zak <zakkr@zf.jcu.cz>
- http://home.zf.jcu.cz/~zakkr/
- C, PostgreSQL, PHP, WWW, http://docs.linux.cz, http://mape.jcu.cz
----------------------------(end of broadcast)---------------------------
-TIP 2: you can get off all lists at once with the unregister command
- (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
-From pgsql-hackers-owner+M21218@postgresql.org Fri Apr 12 04:52:19 2002
-Return-path: <pgsql-hackers-owner+M21218@postgresql.org>\r
-Received: from postgresql.org (postgresql.org [])\r
- by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g3C8qIS25666\r
- for <pgman@candle.pha.pa.us>; Fri, 12 Apr 2002 04:52:18 -0400 (EDT)\r
-Received: from postgresql.org (postgresql.org [])\r
- by postgresql.org (Postfix) with SMTP\r
- id AE2FA4769F1; Fri, 12 Apr 2002 03:54:34 -0400 (EDT)\r
-Received: from ara.zf.jcu.cz (ara.zf.jcu.cz [])\r
- by postgresql.org (Postfix) with ESMTP id A05A94769DC\r
- for <pgsql-hackers@postgresql.org>; Fri, 12 Apr 2002 03:51:27 -0400 (EDT)\r
-Received: from ara.zf.jcu.cz (LOCALHOST [])\r
- by ara.zf.jcu.cz (8.12.1/8.12.1/Debian -5) with ESMTP id g3C7pHBK012031;\r
- Fri, 12 Apr 2002 09:51:17 +0200\r
-Received: (from zakkr@localhost)\r
- by ara.zf.jcu.cz (8.12.1/8.12.1/Debian -5) id g3C7pGum012030;\r
- Fri, 12 Apr 2002 09:51:16 +0200\r
-Date: Fri, 12 Apr 2002 09:51:16 +0200\r
-From: Karel Zak <zakkr@zf.jcu.cz>\r
-To: pgsql-hackers@postgresql.org\r
-cc: Hiroshi Inoue <Inoue@tpf.co.jp>\r
-Subject: Re: [HACKERS] 7.3 schedule\r
-Message-ID: <20020412095116.B6370@zf.jcu.cz>\r
-References: <GNELIHDDFBOCMGBFGEFOGEBHCCAA.chriskl@familyhealth.com.au> <3CB52C54.4020507@freaky-namuh.com> <20020411115434.201ff92f.nconway@klamath.dyndns.org> <3CB61DAB.5010601@freaky-namuh.com> <24184.1018581907@sss.pgh.pa.us> <3CB65B49.93F2F790@tpf.co.jp> <20020412004134.5d35a2dd.nconway@klamath.dyndns.org>\r
-MIME-Version: 1.0\r
-Content-Type: text/plain; charset=us-ascii\r
-Content-Disposition: inline\r
-User-Agent: Mutt/1.2.5i\r
-In-Reply-To: <20020412004134.5d35a2dd.nconway@klamath.dyndns.org>; from nconway@klamath.dyndns.org on Fri, Apr 12, 2002 at 12:41:34AM -0400\r
-Precedence: bulk\r
-Sender: pgsql-hackers-owner@postgresql.org\r
-Status: OR\r
-On Fri, Apr 12, 2002 at 12:41:34AM -0400, Neil Conway wrote:
-> On Fri, 12 Apr 2002 12:58:01 +0900
-> "Hiroshi Inoue" <Inoue@tpf.co.jp> wrote:
-> >
-> > Just a confirmation.
-> > Someone is working on PREPARE/EXECUTE ?
-> > What about Karel's work ?
- Right question :-)
-> I am. My work is based on Karel's stuff -- at the moment I'm still
-> basically working on getting Karel's patch to play nicely with
-> current sources; once that's done I'll be addressing whatever
-> issues are stopping the code from getting into CVS.
- My patch (qcache) for PostgreSQL 7.0 is available at
- ftp://ftp2.zf.jcu.cz/users/zakkr/pg/.
- I very look forward to Neil's work on this.
- Notes:
- * It's experimental patch, but usable. All features below mentioned
- works.
- * PREPARE/EXECUTE is not only SQL statements, I think good idea is
- create something common and robus for query-plan caching,
- beacuse there is for example SPI too. The RI triggers are based
- on SPI_saveplan().
- * My patch knows EXECUTE INTO feature:
- PREPARE foo AS SELECT * FROM pg_class WHERE relname ~~ $1 USING text;
- EXECUTE foo USING 'pg%'; <-- standard select
- EXECUTE foo INTO TEMP newtab USING 'pg%'; <-- select into
- * The patch allows store query-planns to shared memory and is
- possible EXECUTE it at more backends (over same DB) and planns
- are persistent across connetions. For this feature I create special
- memory context subsystem (like current aset.c, but it works with
- IPC shared memory).
- This is maybe too complex solution and (maybe) sufficient is cache
- query in one backend only. I know unbelief about this shared
- memory solution (Tom?).
- Karel
- My experimental patch README (excuse my English):
- Implementation
- ~~~~~~~~~~~~~~
- The qCache allows save queryTree and queryPlan. There is available are
- two space for data caching.
- LOCAL - data are cached in backend non-shared memory and data aren't
- available in other backends.
- SHARE - data are cached in backend shared memory and data are
- visible in all backends.
- Because size of share memory pool is limited and it is set during
- postmaster start up, the qCache must remove all old planns if pool is
- full. You can mark each entry as "REMOVEABLE" or "NOTREMOVEABLE".
- A removeable entry is removed if pool is full.
- A not-removeable entry must be removed via qCache_Remove() or
- the other routines. The qCache not remove this entry itself.
- All records in qCache are cached (in the hash table) under some key.
- The qCache knows two alternate of key --- "KEY_STRING" and "KEY_BINARY".
- The qCache API not allows access to shared memory, all cached planns that
- API returns are copy to CurrentMemoryContext. All (qCache_ ) routines lock
- shmem itself (exception is qCache_RemoveOldest_ShareRemoveAble()).
- - for locking is used spin lock.
- Memory management
- ~~~~~~~~~~~~~~~~~
- The qCache use for qCache's shared pool its memory context independent on
- standard aset/mcxt, but use compatible API --- it allows to use standard
- palloc() (it is very needful for basic plan-tree operations, an example
- for copyObject()). The qCache memory management is very simular to current
- aset.c code. It is chunk-ed blocks too, but the block is smaller - 1024b.
- The number of blocks is available set in postmaster 'argv' via option
- '-Z'.
- For plan storing is used separate MemoryContext for each plan, it
- is good idea (Hiroshi's ?), bucause create new context is simple and
- inexpensive and allows easy destroy (free) cached plan. This method is
- used in my SPI overhaul instead TopMemoryContext feeding.
- Postmaster
- ~~~~~~~~~~
- The query cache memory is init during potmaster startup. The size of
- query cache pool is set via '-Z <number-of-blocks>' switch --- default
- is 100 blocks where 1 block = 1024b, it is sufficient for 20-30 cached
- planns. One query needs somewhere 3-10 blocks, for example query like
- PREPARE sel AS SELECT * FROM pg_class;
- needs 10Kb, because table pg_class has very much columns.
- Note: for development I add SQL function: "SELECT qcache_state();",
- this routine show usage of qCache.
- ~~~
- I a little overwrite SPI save plan method and remove TopMemoryContext
- "feeding".
- Standard SPI:
- SPI_saveplan() - save each plan to separate standard memory context.
- SPI_freeplan() - free plan.
- By key SPI:
- It is SPI interface for query cache and allows save planns to SHARED
- or LOCAL cache 'by' arbitrary key (string or binary). Routines:
- SPI_saveplan_bykey() - save plan to query cache
- SPI_freeplan_bykey() - remove plan from query cache
- SPI_fetchplan_bykey() - fetch plan saved in query cache
- SPI_execp_bykey() - execute (via SPI) plan saved in query
- cache
- - now, users can write functions that save planns to shared memory
- and planns are visible in all backend and are persistent arcoss
- connection.
- Example:
- ~~~~~~~
- /* ----------
- * Save/exec query from shared cache via string key
- * ----------
- */
- int keySize = 0;
- char *key = "my unique key";
- res = SPI_execp_bykey(values, nulls, tcount, key, flag, keySize);
- {
- /* --- not plan in cache - must create it --- */
- void *plan;
- plan = SPI_prepare(querystr, valnum, valtypes);
- SPI_saveplan_bykey(plan, key, keySize, flag);
- res = SPI_execute(plan, values, Nulls, tcount);
- }
- elog(NOTICE, "Processed: %d", SPI_processed);
- ~~~~~~~~~~~~~~~
- * Syntax:
- PREPARE <name> AS <query>
- [ USING type, ... typeN ]
- EXECUTE <name>
- [ INTO [ TEMPORARY | TEMP ] [ TABLE ] new_table ]
- [ USING val, ... valN ]
- [ <name> [ NOSHARE | SHARE | GLOBAL ]]
- I know that it is a little out of SQL92... (use CREATE/DROP PLAN instead
- this?) --- what mean SQL standard guru?
- * Where:
- NOSHARE --- cached in local backend query cache - not accessable
- from the others backends and not is persisten a across
- conection.
- SHARE --- cached in shared query cache and accessable from
- all backends which work over same database.
- GLOBAL --- cached in shared query cache and accessable from
- all backends and all databases.
- - default is 'SHARE'
- Deallocate:
- ALL --- deallocate all users's plans
- ALL INTERNAL --- deallocate all internal plans, like planns
- cached via SPI. It is needful if user
- alter/drop table ...etc.
- * Parameters:
- "USING" part in the prepare statement is for datetype setting for
- paremeters in the query. For example:
- PREPARE sel AS SELECT * FROM pg_class WHERE relname ~~ $1 USING text;
- EXECUTE sel USING 'pg%';
- * Limitation:
- - prepare/execute allow use full statement of SELECT/INSERT/DELETE/
- - possible is use union, subselects, limit, ofset, select-into
- Performance:
- ~~~~~~~~~~~
- * the SPI
- - I for my tests a little change RI triggers to use SPI by_key API
- and save planns to shared qCache instead to internal RI hash table.
- The RI use very simple (for parsing) queries and qCache interest is
- not visible. It's better if backend very often startup and RI check
- always same tables. In this situation speed go up --- 10-12%.
- (This snapshot not include this RI change.)
- But all depend on how much complicate for parser is query in
- trigger.
- - For tests I use query that not use some table (the executor is
- in boredom state), but is difficult for the parser. An example:
- SELECT 'a text ' || (10*10+(100^2))::text || ' next text ' || cast
- (date_part('year', timestamp 'now') AS text );
- - (10000 * this query):
- standard select: 54 sec
- via prepare/execute: 4 sec (93% better)
- IMHO it is nod bad.
- - For standard query like:
- SELECT u.usename, r.relname FROM pg_class r, pg_user u WHERE
- r.relowner = u.usesysid;
- it is with PREPARE/EXECUTE 10-20% faster.
- Karel Zak <zakkr@zf.jcu.cz>
- http://home.zf.jcu.cz/~zakkr/
- C, PostgreSQL, PHP, WWW, http://docs.linux.cz, http://mape.jcu.cz
----------------------------(end of broadcast)---------------------------
-TIP 4: Don't 'kill -9' the postmaster
-From pgsql-hackers-owner+M21228@postgresql.org Fri Apr 12 10:15:34 2002
-Return-path: <pgsql-hackers-owner+M21228@postgresql.org>\r
-Received: from postgresql.org (postgresql.org [])\r
- by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g3CEFXS29835\r
- for <pgman@candle.pha.pa.us>; Fri, 12 Apr 2002 10:15:33 -0400 (EDT)\r
-Received: from postgresql.org (postgresql.org [])\r
- by postgresql.org (Postfix) with SMTP\r
- id 7BFE1475A55; Fri, 12 Apr 2002 10:15:27 -0400 (EDT)\r
-Received: from sss.pgh.pa.us (unknown [])\r
- by postgresql.org (Postfix) with ESMTP id 5659B474E71\r
- for <pgsql-hackers@postgresql.org>; Fri, 12 Apr 2002 10:14:31 -0400 (EDT)\r
-Received: from sss2.sss.pgh.pa.us (tgl@localhost [])\r
- by sss.pgh.pa.us (8.11.4/8.11.4) with ESMTP id g3CEEQF27238;\r
- Fri, 12 Apr 2002 10:14:26 -0400 (EDT)\r
-To: Karel Zak <zakkr@zf.jcu.cz>\r
-cc: pgsql-hackers@postgresql.org, Neil Conway <nconway@klamath.dyndns.org>\r
-Subject: Re: [HACKERS] 7.3 schedule \r
-In-Reply-To: <20020412095116.B6370@zf.jcu.cz> \r
-References: <GNELIHDDFBOCMGBFGEFOGEBHCCAA.chriskl@familyhealth.com.au> <3CB52C54.4020507@freaky-namuh.com> <20020411115434.201ff92f.nconway@klamath.dyndns.org> <3CB61DAB.5010601@freaky-namuh.com> <24184.1018581907@sss.pgh.pa.us> <3CB65B49.93F2F790@tpf.co.jp> <20020412004134.5d35a2dd.nconway@klamath.dyndns.org> <20020412095116.B6370@zf.jcu.cz>\r
-Comments: In-reply-to Karel Zak <zakkr@zf.jcu.cz>\r
- message dated "Fri, 12 Apr 2002 09:51:16 +0200"\r
-Date: Fri, 12 Apr 2002 10:14:26 -0400\r
-Message-ID: <27235.1018620866@sss.pgh.pa.us>\r
-From: Tom Lane <tgl@sss.pgh.pa.us>\r
-Precedence: bulk\r
-Sender: pgsql-hackers-owner@postgresql.org\r
-Status: ORr\r
-Karel Zak <zakkr@zf.jcu.cz> writes:
-> * The patch allows store query-planns to shared memory and is
-> possible EXECUTE it at more backends (over same DB) and planns
-> are persistent across connetions. For this feature I create special
-> memory context subsystem (like current aset.c, but it works with
-> IPC shared memory).
-> This is maybe too complex solution and (maybe) sufficient is cache
-> query in one backend only. I know unbelief about this shared
-> memory solution (Tom?).
-Yes, that is the part that was my sticking point last time around.
-(1) Because shared memory cannot be extended on-the-fly, I think it is
-a very bad idea to put data structures in there without some well
-thought out way of predicting/limiting their size. (2) How the heck do
-you get rid of obsoleted cached plans, if the things stick around in
-shared memory even after you start a new backend? (3) A shared cache
-requires locking; contention among multiple backends to access that
-shared resource could negate whatever performance benefit you might hope
-to realize from it.
-A per-backend cache kept in local memory avoids all of these problems,
-and I have seen no numbers to make me think that a shared plan cache
-would achieve significantly more performance benefit than a local one.
- regards, tom lane
----------------------------(end of broadcast)---------------------------
-TIP 5: Have you checked our extensive FAQ?
-From pgsql-hackers-owner+M21233@postgresql.org Fri Apr 12 12:26:32 2002
-Return-path: <pgsql-hackers-owner+M21233@postgresql.org>\r
-Received: from postgresql.org (postgresql.org [])\r
- by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g3CGQVS11018\r
- for <pgman@candle.pha.pa.us>; Fri, 12 Apr 2002 12:26:31 -0400 (EDT)\r
-Received: from postgresql.org (postgresql.org [])\r
- by postgresql.org (Postfix) with SMTP\r
- id 38DBB475B20; Fri, 12 Apr 2002 12:22:08 -0400 (EDT)\r
-Received: from candle.pha.pa.us (216-55-132-35.dsl.san-diego.abac.net [])\r
- by postgresql.org (Postfix) with ESMTP id 0DA70475B9E\r
- for <pgsql-hackers@postgresql.org>; Fri, 12 Apr 2002 12:21:15 -0400 (EDT)\r
-Received: (from pgman@localhost)\r
- by candle.pha.pa.us (8.11.6/8.10.1) id g3CGL4310492;\r
- Fri, 12 Apr 2002 12:21:04 -0400 (EDT)\r
-From: Bruce Momjian <pgman@candle.pha.pa.us>\r
-Message-ID: <200204121621.g3CGL4310492@candle.pha.pa.us>\r
-Subject: Re: [HACKERS] 7.3 schedule\r
-In-Reply-To: <27235.1018620866@sss.pgh.pa.us>\r
-To: Tom Lane <tgl@sss.pgh.pa.us>\r
-Date: Fri, 12 Apr 2002 12:21:04 -0400 (EDT)\r
-cc: Karel Zak <zakkr@zf.jcu.cz>, pgsql-hackers@postgresql.org,\r
- Neil Conway <nconway@klamath.dyndns.org>\r
-X-Mailer: ELM [version 2.4ME+ PL97 (25)]\r
-MIME-Version: 1.0\r
-Content-Transfer-Encoding: 7bit\r
-Content-Type: text/plain; charset=US-ASCII\r
-Precedence: bulk\r
-Sender: pgsql-hackers-owner@postgresql.org\r
-Status: OR\r
-Tom Lane wrote:
-> Karel Zak <zakkr@zf.jcu.cz> writes:
-> > * The patch allows store query-planns to shared memory and is
-> > possible EXECUTE it at more backends (over same DB) and planns
-> > are persistent across connetions. For this feature I create special
-> > memory context subsystem (like current aset.c, but it works with
-> > IPC shared memory).
-> > This is maybe too complex solution and (maybe) sufficient is cache
-> > query in one backend only. I know unbelief about this shared
-> > memory solution (Tom?).
-> Yes, that is the part that was my sticking point last time around.
-> (1) Because shared memory cannot be extended on-the-fly, I think it is
-> a very bad idea to put data structures in there without some well
-> thought out way of predicting/limiting their size. (2) How the heck do
-> you get rid of obsoleted cached plans, if the things stick around in
-> shared memory even after you start a new backend? (3) A shared cache
-> requires locking; contention among multiple backends to access that
-> shared resource could negate whatever performance benefit you might hope
-> to realize from it.
-> A per-backend cache kept in local memory avoids all of these problems,
-> and I have seen no numbers to make me think that a shared plan cache
-> would achieve significantly more performance benefit than a local one.
-Certainly a shared cache would be good for apps that connect to issue a
-single query frequently. In such cases, there would be no local cache
-to use.
- Bruce Momjian | http://candle.pha.pa.us
- pgman@candle.pha.pa.us | (610) 853-3000
- + If your life is a hard drive, | 830 Blythe Avenue
- + Christ can be your backup. | Drexel Hill, Pennsylvania 19026
----------------------------(end of broadcast)---------------------------
-TIP 4: Don't 'kill -9' the postmaster
-From pgsql-hackers-owner+M21234@postgresql.org Fri Apr 12 12:44:12 2002
-Return-path: <pgsql-hackers-owner+M21234@postgresql.org>\r
-Received: from postgresql.org (postgresql.org [])\r
- by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g3CGiBS12385\r
- for <pgman@candle.pha.pa.us>; Fri, 12 Apr 2002 12:44:12 -0400 (EDT)\r
-Received: from postgresql.org (postgresql.org [])\r
- by postgresql.org (Postfix) with SMTP\r
- id AEAA7475C6C; Fri, 12 Apr 2002 12:43:17 -0400 (EDT)\r
-Received: from barry.xythos.com (h-64-105-36-191.SNVACAID.covad.net [])\r
- by postgresql.org (Postfix) with ESMTP id CE58C47598E\r
- for <pgsql-hackers@postgresql.org>; Fri, 12 Apr 2002 12:42:48 -0400 (EDT)\r
-Received: from xythos.com (localhost.localdomain [])\r
- by barry.xythos.com (8.11.6/8.11.6) with ESMTP id g3CGgaI02920;\r
- Fri, 12 Apr 2002 09:42:36 -0700\r
-Message-ID: <3CB70E7C.3090801@xythos.com>\r
-Date: Fri, 12 Apr 2002 09:42:36 -0700\r
-From: Barry Lind <barry@xythos.com>\r
-User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:0.9.9) Gecko/20020310\r
-X-Accept-Language: en-us, en\r
-MIME-Version: 1.0\r
-To: Tom Lane <tgl@sss.pgh.pa.us>\r
-cc: Karel Zak <zakkr@zf.jcu.cz>, pgsql-hackers@postgresql.org,\r
- Neil Conway <nconway@klamath.dyndns.org>\r
-Subject: Re: [HACKERS] 7.3 schedule\r
-References: <GNELIHDDFBOCMGBFGEFOGEBHCCAA.chriskl@familyhealth.com.au> <3CB52C54.4020507@freaky-namuh.com> <20020411115434.201ff92f.nconway@klamath.dyndns.org> <3CB61DAB.5010601@freaky-namuh.com> <24184.1018581907@sss.pgh.pa.us> <3CB65B49.93F2F790@tpf.co.jp> <20020412004134.5d35a2dd.nconway@klamath.dyndns.org> <20020412095116.B6370@zf.jcu.cz> <27235.1018620866@sss.pgh.pa.us>\r
-Content-Type: text/plain; charset=us-ascii; format=flowed\r
-Content-Transfer-Encoding: 7bit\r
-Precedence: bulk\r
-Sender: pgsql-hackers-owner@postgresql.org\r
-Status: ORr\r
-Tom Lane wrote:
-> Yes, that is the part that was my sticking point last time around.
-> (1) Because shared memory cannot be extended on-the-fly, I think it is
-> a very bad idea to put data structures in there without some well
-> thought out way of predicting/limiting their size. (2) How the heck do
-> you get rid of obsoleted cached plans, if the things stick around in
-> shared memory even after you start a new backend? (3) A shared cache
-> requires locking; contention among multiple backends to access that
-> shared resource could negate whatever performance benefit you might hope
-> to realize from it.
-> A per-backend cache kept in local memory avoids all of these problems,
-> and I have seen no numbers to make me think that a shared plan cache
-> would achieve significantly more performance benefit than a local one.
-Oracle's implementation is a shared cache for all plans. This was
-introduced in Oracle 6 or 7 (I don't remember which anymore). The net
-effect was that in general there was a significant performance
-improvement with the shared cache. However poorly written apps can now
-bring the Oracle database to its knees because of the locking issues
-associated with the shared cache. For example if the most frequently
-run sql statements are coded poorly (i.e. they don't use bind variables,
-eg. 'select bar from foo where foobar = $1' vs. 'select bar from foo
-where foobar = || somevalue' (where somevalue is likely to be
-different on every call)) the shared cache doesn't help and its overhead
-becomes significant.
----------------------------(end of broadcast)---------------------------
-TIP 3: if posting/reading through Usenet, please send an appropriate
-subscribe-nomail command to majordomo@postgresql.org so that your
-message can get through to the mailing list cleanly
-From pgsql-hackers-owner+M21237@postgresql.org Fri Apr 12 12:50:28 2002
-Return-path: <pgsql-hackers-owner+M21237@postgresql.org>\r
-Received: from postgresql.org (postgresql.org [])\r
- by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g3CGoRS13005\r
- for <pgman@candle.pha.pa.us>; Fri, 12 Apr 2002 12:50:28 -0400 (EDT)\r
-Received: from postgresql.org (postgresql.org [])\r
- by postgresql.org (Postfix) with SMTP\r
- id 32A28475BA1; Fri, 12 Apr 2002 12:50:15 -0400 (EDT)\r
-Received: from candle.pha.pa.us (216-55-132-35.dsl.san-diego.abac.net [])\r
- by postgresql.org (Postfix) with ESMTP id 07F1E475892\r
- for <pgsql-hackers@postgresql.org>; Fri, 12 Apr 2002 12:49:43 -0400 (EDT)\r
-Received: (from pgman@localhost)\r
- by candle.pha.pa.us (8.11.6/8.10.1) id g3CGnbw12950;\r
- Fri, 12 Apr 2002 12:49:37 -0400 (EDT)\r
-From: Bruce Momjian <pgman@candle.pha.pa.us>\r
-Message-ID: <200204121649.g3CGnbw12950@candle.pha.pa.us>\r
-Subject: Re: [HACKERS] 7.3 schedule\r
-In-Reply-To: <3CB70E7C.3090801@xythos.com>\r
-To: Barry Lind <barry@xythos.com>\r
-Date: Fri, 12 Apr 2002 12:49:37 -0400 (EDT)\r
-cc: Tom Lane <tgl@sss.pgh.pa.us>, Karel Zak <zakkr@zf.jcu.cz>,\r
- pgsql-hackers@postgresql.org, Neil Conway <nconway@klamath.dyndns.org>\r
-X-Mailer: ELM [version 2.4ME+ PL97 (25)]\r
-MIME-Version: 1.0\r
-Content-Transfer-Encoding: 7bit\r
-Content-Type: text/plain; charset=US-ASCII\r
-Precedence: bulk\r
-Sender: pgsql-hackers-owner@postgresql.org\r
-Status: OR\r
-Barry Lind wrote:
-> Oracle's implementation is a shared cache for all plans. This was
-> introduced in Oracle 6 or 7 (I don't remember which anymore). The net
-> effect was that in general there was a significant performance
-> improvement with the shared cache. However poorly written apps can now
-> bring the Oracle database to its knees because of the locking issues
-> associated with the shared cache. For example if the most frequently
-> run sql statements are coded poorly (i.e. they don't use bind variables,
-> eg. 'select bar from foo where foobar = $1' vs. 'select bar from foo
-> where foobar = || somevalue' (where somevalue is likely to be
-> different on every call)) the shared cache doesn't help and its overhead
-> becomes significant.
-This is very interesting. We have always been concerned that shared
-cache invalidation could cause more of a performance problem that the
-shared cache gives benefit, and it sounds like you are saying exactly
- Bruce Momjian | http://candle.pha.pa.us
- pgman@candle.pha.pa.us | (610) 853-3000
- + If your life is a hard drive, | 830 Blythe Avenue
- + Christ can be your backup. | Drexel Hill, Pennsylvania 19026
----------------------------(end of broadcast)---------------------------
-TIP 2: you can get off all lists at once with the unregister command
- (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
-From pgsql-hackers-owner+M21238@postgresql.org Fri Apr 12 12:51:55 2002
-Return-path: <pgsql-hackers-owner+M21238@postgresql.org>\r
-Received: from postgresql.org (postgresql.org [])\r
- by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g3CGptS13119\r
- for <pgman@candle.pha.pa.us>; Fri, 12 Apr 2002 12:51:55 -0400 (EDT)\r
-Received: from postgresql.org (postgresql.org [])\r
- by postgresql.org (Postfix) with SMTP\r
- id C599D475BC6; Fri, 12 Apr 2002 12:51:47 -0400 (EDT)\r
-Received: from sss.pgh.pa.us (unknown [])\r
- by postgresql.org (Postfix) with ESMTP id C9F94475892\r
- for <pgsql-hackers@postgresql.org>; Fri, 12 Apr 2002 12:51:26 -0400 (EDT)\r
-Received: from sss2.sss.pgh.pa.us (tgl@localhost [])\r
- by sss.pgh.pa.us (8.11.4/8.11.4) with ESMTP id g3CGpQF27967;\r
- Fri, 12 Apr 2002 12:51:27 -0400 (EDT)\r
-To: Bruce Momjian <pgman@candle.pha.pa.us>\r
-cc: Karel Zak <zakkr@zf.jcu.cz>, pgsql-hackers@postgresql.org,\r
- Neil Conway <nconway@klamath.dyndns.org>\r
-Subject: Re: [HACKERS] 7.3 schedule \r
-In-Reply-To: <200204121621.g3CGL4310492@candle.pha.pa.us> \r
-References: <200204121621.g3CGL4310492@candle.pha.pa.us>\r
-Comments: In-reply-to Bruce Momjian <pgman@candle.pha.pa.us>\r
- message dated "Fri, 12 Apr 2002 12:21:04 -0400"\r
-Date: Fri, 12 Apr 2002 12:51:26 -0400\r
-Message-ID: <27964.1018630286@sss.pgh.pa.us>\r
-From: Tom Lane <tgl@sss.pgh.pa.us>\r
-Precedence: bulk\r
-Sender: pgsql-hackers-owner@postgresql.org\r
-Status: OR\r
-Bruce Momjian <pgman@candle.pha.pa.us> writes:
-> Certainly a shared cache would be good for apps that connect to issue a
-> single query frequently. In such cases, there would be no local cache
-> to use.
-We have enough other problems with the single-query-per-connection
-scenario that I see no reason to believe that a shared plan cache will
-help materially. The correct answer for those folks will *always* be
-to find a way to reuse the connection.
- regards, tom lane
----------------------------(end of broadcast)---------------------------
-TIP 4: Don't 'kill -9' the postmaster
-From pgsql-hackers-owner+M21241@postgresql.org Fri Apr 12 16:25:46 2002
-Return-path: <pgsql-hackers-owner+M21241@postgresql.org>\r
-Received: from postgresql.org (postgresql.org [])\r
- by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g3CKPkS03078\r
- for <pgman@candle.pha.pa.us>; Fri, 12 Apr 2002 16:25:46 -0400 (EDT)\r
-Received: from postgresql.org (postgresql.org [])\r
- by postgresql.org (Postfix) with SMTP\r
- id 9C3BD475CC6; Fri, 12 Apr 2002 16:25:42 -0400 (EDT)\r
-Received: from klamath.dyndns.org (CPE002078144ae0.cpe.net.cable.rogers.com [])\r
- by postgresql.org (Postfix) with ESMTP id B06D8475909\r
- for <pgsql-hackers@postgresql.org>; Fri, 12 Apr 2002 16:24:52 -0400 (EDT)\r
-Received: from jiro (jiro [])\r
- by klamath.dyndns.org (Postfix) with SMTP\r
- id C05557013; Fri, 12 Apr 2002 16:24:53 -0400 (EDT)\r
-Date: Fri, 12 Apr 2002 16:24:48 -0400\r
-From: Neil Conway <nconway@klamath.dyndns.org>\r
-To: "Bruce Momjian" <pgman@candle.pha.pa.us>\r
-cc: tgl@sss.pgh.pa.us, zakkr@zf.jcu.cz, pgsql-hackers@postgresql.org\r
-Subject: Re: [HACKERS] 7.3 schedule\r
-Message-ID: <20020412162448.4d46d747.nconway@klamath.dyndns.org>\r
-In-Reply-To: <200204121621.g3CGL4310492@candle.pha.pa.us>\r
-References: <27235.1018620866@sss.pgh.pa.us>\r
- <200204121621.g3CGL4310492@candle.pha.pa.us>\r
-X-Mailer: Sylpheed version 0.7.4 (GTK+ 1.2.10; i386-debian-linux-gnu)\r
-MIME-Version: 1.0\r
-Content-Type: text/plain; charset=US-ASCII\r
-Content-Transfer-Encoding: 7bit\r
-Precedence: bulk\r
-Sender: pgsql-hackers-owner@postgresql.org\r
-Status: ORr\r
-On Fri, 12 Apr 2002 12:21:04 -0400 (EDT)
-"Bruce Momjian" <pgman@candle.pha.pa.us> wrote:
-> Tom Lane wrote:
-> > A per-backend cache kept in local memory avoids all of these problems,
-> > and I have seen no numbers to make me think that a shared plan cache
-> > would achieve significantly more performance benefit than a local one.
-> Certainly a shared cache would be good for apps that connect to issue a
-> single query frequently. In such cases, there would be no local cache
-> to use.
-One problem with this kind of scenario is: what to do if the plan no
-longer exists for some reason? (e.g. the code that was supposed to be
-PREPARE-ing your statements failed to execute properly, or the cached
-plan has been evicted from shared memory, or the database was restarted,
-etc.) -- EXECUTE in and of itself won't have enough information to do
-anything useful. We could perhaps provide a means for an application
-to test for the existence of a cached plan (in which case the
-application developer will need to add logic to their application
-to re-prepare the query if necessary, which could get complicated).
-Neil Conway <neilconway@rogers.com>
----------------------------(end of broadcast)---------------------------
-TIP 5: Have you checked our extensive FAQ?
-From pgsql-hackers-owner+M21242@postgresql.org Fri Apr 12 17:27:24 2002
-Return-path: <pgsql-hackers-owner+M21242@postgresql.org>\r
-Received: from postgresql.org (postgresql.org [])\r
- by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g3CLRNS14410\r
- for <pgman@candle.pha.pa.us>; Fri, 12 Apr 2002 17:27:23 -0400 (EDT)\r
-Received: from postgresql.org (postgresql.org [])\r
- by postgresql.org (Postfix) with SMTP\r
- id E05A1475D30; Fri, 12 Apr 2002 17:26:40 -0400 (EDT)\r
-Received: from candle.pha.pa.us (216-55-132-35.dsl.san-diego.abac.net [])\r
- by postgresql.org (Postfix) with ESMTP id 36BBB475858\r
- for <pgsql-hackers@postgresql.org>; Fri, 12 Apr 2002 17:25:44 -0400 (EDT)\r
-Received: (from pgman@localhost)\r
- by candle.pha.pa.us (8.11.6/8.10.1) id g3CLPVa14231;\r
- Fri, 12 Apr 2002 17:25:31 -0400 (EDT)\r
-From: Bruce Momjian <pgman@candle.pha.pa.us>\r
-Message-ID: <200204122125.g3CLPVa14231@candle.pha.pa.us>\r
-Subject: Re: [HACKERS] 7.3 schedule\r
-In-Reply-To: <20020412162448.4d46d747.nconway@klamath.dyndns.org>\r
-To: Neil Conway <nconway@klamath.dyndns.org>\r
-Date: Fri, 12 Apr 2002 17:25:31 -0400 (EDT)\r
-cc: tgl@sss.pgh.pa.us, zakkr@zf.jcu.cz, pgsql-hackers@postgresql.org\r
-X-Mailer: ELM [version 2.4ME+ PL97 (25)]\r
-MIME-Version: 1.0\r
-Content-Transfer-Encoding: 7bit\r
-Content-Type: text/plain; charset=US-ASCII\r
-Precedence: bulk\r
-Sender: pgsql-hackers-owner@postgresql.org\r
-Status: OR\r
-Neil Conway wrote:
-> On Fri, 12 Apr 2002 12:21:04 -0400 (EDT)
-> "Bruce Momjian" <pgman@candle.pha.pa.us> wrote:
-> > Tom Lane wrote:
-> > > A per-backend cache kept in local memory avoids all of these problems,
-> > > and I have seen no numbers to make me think that a shared plan cache
-> > > would achieve significantly more performance benefit than a local one.
-> >
-> > Certainly a shared cache would be good for apps that connect to issue a
-> > single query frequently. In such cases, there would be no local cache
-> > to use.
-> One problem with this kind of scenario is: what to do if the plan no
-> longer exists for some reason? (e.g. the code that was supposed to be
-> PREPARE-ing your statements failed to execute properly, or the cached
-> plan has been evicted from shared memory, or the database was restarted,
-> etc.) -- EXECUTE in and of itself won't have enough information to do
-> anything useful. We could perhaps provide a means for an application
-> to test for the existence of a cached plan (in which case the
-> application developer will need to add logic to their application
-> to re-prepare the query if necessary, which could get complicated).
-Oh, are you thinking that one backend would do the PREPARE and another
-one the EXECUTE? I can't see that working at all. I thought there
-would some way to quickly test if the submitted query was in the cache,
-but maybe that is too much of a performance penalty to be worth it.
- Bruce Momjian | http://candle.pha.pa.us
- pgman@candle.pha.pa.us | (610) 853-3000
- + If your life is a hard drive, | 830 Blythe Avenue
- + Christ can be your backup. | Drexel Hill, Pennsylvania 19026
----------------------------(end of broadcast)---------------------------
-TIP 3: if posting/reading through Usenet, please send an appropriate
-subscribe-nomail command to majordomo@postgresql.org so that your
-message can get through to the mailing list cleanly
-From tgl@sss.pgh.pa.us Fri Apr 12 17:36:17 2002
-Return-path: <tgl@sss.pgh.pa.us>\r
-Received: from sss.pgh.pa.us (root@[])\r
- by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g3CLaGS16061\r
- for <pgman@candle.pha.pa.us>; Fri, 12 Apr 2002 17:36:17 -0400 (EDT)\r
-Received: from sss2.sss.pgh.pa.us (tgl@localhost [])\r
- by sss.pgh.pa.us (8.11.4/8.11.4) with ESMTP id g3CLaGF10813;\r
- Fri, 12 Apr 2002 17:36:16 -0400 (EDT)\r
-To: Bruce Momjian <pgman@candle.pha.pa.us>\r
-cc: Neil Conway <nconway@klamath.dyndns.org>, zakkr@zf.jcu.cz,\r
- pgsql-hackers@postgresql.org\r
-Subject: Re: [HACKERS] 7.3 schedule \r
-In-Reply-To: <200204122125.g3CLPVa14231@candle.pha.pa.us> \r
-References: <200204122125.g3CLPVa14231@candle.pha.pa.us>\r
-Comments: In-reply-to Bruce Momjian <pgman@candle.pha.pa.us>\r
- message dated "Fri, 12 Apr 2002 17:25:31 -0400"\r
-Date: Fri, 12 Apr 2002 17:36:16 -0400\r
-Message-ID: <10810.1018647376@sss.pgh.pa.us>\r
-From: Tom Lane <tgl@sss.pgh.pa.us>\r
-Status: ORr\r
-Bruce Momjian <pgman@candle.pha.pa.us> writes:
-> Oh, are you thinking that one backend would do the PREPARE and another
-> one the EXECUTE? I can't see that working at all.
-Uh, why exactly were you advocating a shared cache then? Wouldn't that
-be exactly the *point* of a shared cache?
- regards, tom lane
-From pgsql-hackers-owner+M21245@postgresql.org Fri Apr 12 17:39:13 2002
-Return-path: <pgsql-hackers-owner+M21245@postgresql.org>\r
-Received: from postgresql.org (postgresql.org [])\r
- by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g3CLdCS16515\r
- for <pgman@candle.pha.pa.us>; Fri, 12 Apr 2002 17:39:12 -0400 (EDT)\r
-Received: from postgresql.org (postgresql.org [])\r
- by postgresql.org (Postfix) with SMTP\r
- id A904B475E15; Fri, 12 Apr 2002 17:39:09 -0400 (EDT)\r
-Received: from candle.pha.pa.us (216-55-132-35.dsl.san-diego.abac.net [])\r
- by postgresql.org (Postfix) with ESMTP id B1A3F4758DE\r
- for <pgsql-hackers@postgresql.org>; Fri, 12 Apr 2002 17:38:25 -0400 (EDT)\r
-Received: (from pgman@localhost)\r
- by candle.pha.pa.us (8.11.6/8.10.1) id g3CLcFX16347;\r
- Fri, 12 Apr 2002 17:38:15 -0400 (EDT)\r
-From: Bruce Momjian <pgman@candle.pha.pa.us>\r
-Message-ID: <200204122138.g3CLcFX16347@candle.pha.pa.us>\r
-Subject: Re: [HACKERS] 7.3 schedule\r
-In-Reply-To: <10810.1018647376@sss.pgh.pa.us>\r
-To: Tom Lane <tgl@sss.pgh.pa.us>\r
-Date: Fri, 12 Apr 2002 17:38:15 -0400 (EDT)\r
-cc: Neil Conway <nconway@klamath.dyndns.org>, zakkr@zf.jcu.cz,\r
- pgsql-hackers@postgresql.org\r
-X-Mailer: ELM [version 2.4ME+ PL97 (25)]\r
-MIME-Version: 1.0\r
-Content-Transfer-Encoding: 7bit\r
-Content-Type: text/plain; charset=US-ASCII\r
-Precedence: bulk\r
-Sender: pgsql-hackers-owner@postgresql.org\r
-Status: OR\r
-Tom Lane wrote:
-> Bruce Momjian <pgman@candle.pha.pa.us> writes:
-> > Oh, are you thinking that one backend would do the PREPARE and another
-> > one the EXECUTE? I can't see that working at all.
-> Uh, why exactly were you advocating a shared cache then? Wouldn't that
-> be exactly the *point* of a shared cache?
-I thought it would somehow compare the SQL query string to the cached
-plans and if it matched, it would use that plan rather than make a new
-one. Any DDL statement would flush the cache.
- Bruce Momjian | http://candle.pha.pa.us
- pgman@candle.pha.pa.us | (610) 853-3000
- + If your life is a hard drive, | 830 Blythe Avenue
- + Christ can be your backup. | Drexel Hill, Pennsylvania 19026
----------------------------(end of broadcast)---------------------------
-TIP 5: Have you checked our extensive FAQ?
-From pgsql-hackers-owner+M21246@postgresql.org Fri Apr 12 17:56:58 2002
-Return-path: <pgsql-hackers-owner+M21246@postgresql.org>\r
-Received: from postgresql.org (postgresql.org [])\r
- by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g3CLuvS19021\r
- for <pgman@candle.pha.pa.us>; Fri, 12 Apr 2002 17:56:58 -0400 (EDT)\r
-Received: from postgresql.org (postgresql.org [])\r
- by postgresql.org (Postfix) with SMTP\r
- id 1B4D6475E2C; Fri, 12 Apr 2002 17:56:55 -0400 (EDT)\r
-Received: from voyager.corporate.connx.com (unknown [])\r
- by postgresql.org (Postfix) with ESMTP id 059F1475858\r
- for <pgsql-hackers@postgresql.org>; Fri, 12 Apr 2002 17:56:13 -0400 (EDT)\r
-X-MimeOLE: Produced By Microsoft Exchange V6.0.4712.0\r
-content-class: urn:content-classes:message\r
-MIME-Version: 1.0\r
-Content-Type: text/plain;\r
- charset="iso-8859-1"\r
-Subject: Re: [HACKERS] 7.3 schedule\r
-Date: Fri, 12 Apr 2002 14:59:15 -0700\r
-Message-ID: <D90A5A6C612A39408103E6ECDD77B82906F42C@voyager.corporate.connx.com>\r
-Thread-Topic: [HACKERS] 7.3 schedule\r
-Thread-Index: AcHia2aODSpgXEd4Tluz/N0jN5fJOQAAC//w\r
-From: "Dann Corbit" <DCorbit@connx.com>\r
-To: "Bruce Momjian" <pgman@candle.pha.pa.us>, "Tom Lane" <tgl@sss.pgh.pa.us>\r
-cc: "Neil Conway" <nconway@klamath.dyndns.org>, <zakkr@zf.jcu.cz>,\r
- <pgsql-hackers@postgresql.org>\r
-Precedence: bulk\r
-Sender: pgsql-hackers-owner@postgresql.org\r
-Content-Transfer-Encoding: 8bit\r
-X-MIME-Autoconverted: from quoted-printable to 8bit by candle.pha.pa.us id g3CLuvS19021\r
-Status: OR\r
------Original Message-----
-From: Bruce Momjian [mailto:pgman@candle.pha.pa.us]
-Sent: Friday, April 12, 2002 2:38 PM
-To: Tom Lane
-Cc: Neil Conway; zakkr@zf.jcu.cz; pgsql-hackers@postgresql.org
-Subject: Re: [HACKERS] 7.3 schedule
-Tom Lane wrote:
-> Bruce Momjian <pgman@candle.pha.pa.us> writes:
-> > Oh, are you thinking that one backend would do the PREPARE and
-> > one the EXECUTE? I can't see that working at all.
-> Uh, why exactly were you advocating a shared cache then? Wouldn't
-> be exactly the *point* of a shared cache?
-I thought it would somehow compare the SQL query string to the cached
-plans and if it matched, it would use that plan rather than make a new
-one. Any DDL statement would flush the cache.
-Many applications will have similar queries coming from lots of
-different end-users. Imagine an order-entry program where people are
-ordering parts. Many of the queries might look like this:
-SELECT part_number FROM parts WHERE part_id = 12324 AND part_cost
-< 12.95
-In order to cache this query, we first parse it to replace the data
-fields with paramter markers.
-Then it looks like this:
-SELECT part_number FROM parts WHERE part_id = ? AND part_cost < ?
-{in the case of a 'LIKE' query or some other query where you can use
-key information, you might have a symbolic replacement like this:
-WHERE field LIKE '{D}%' to indicate that the key can be used}
-Then, we make sure that the case is consistent by either capitalizing
-the whole query or changing it all into lower case:
-select part_number from parts where part_id = ? and part_cost < ?
-Then, we run a checksum on the parameterized string.
-The checksum might be used as a hash table key, where we keep some
-additional information like how stale the entry is, and a pointer to
-the actual parameterized SQL (in case the hash key has a collision
-it would be simply wrong to run an incorrect query for obvious enough
-Now, if there are a huge number of users of the same application, it
-makes sense that the probabilities of reusing queries goes up with
-the number of users of the same application. Therefore, I would
-advocate that the cache be kept in shared memory.
-Consider a single application with 100 different queries. Now, add
-one user, ten users, 100 users, ... 10,000 users and you can see
-that the benefit would be greater and greater as we add users.
----------------------------(end of broadcast)---------------------------
-TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
-From pgsql-hackers-owner+M21270@postgresql.org Sat Apr 13 02:30:47 2002
-Return-path: <pgsql-hackers-owner+M21270@postgresql.org>\r
-Received: from postgresql.org (postgresql.org [])\r
- by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g3D6UkS07169\r
- for <pgman@candle.pha.pa.us>; Sat, 13 Apr 2002 02:30:46 -0400 (EDT)\r
-Received: from postgresql.org (postgresql.org [])\r
- by postgresql.org (Postfix) with SMTP\r
- id 23FEC475D1E; Sat, 13 Apr 2002 02:30:38 -0400 (EDT)\r
-Received: from mail.iinet.net.au (symphony-01.iinet.net.au [])\r
- by postgresql.org (Postfix) with SMTP id A08A4475C6C\r
- for <pgsql-hackers@postgresql.org>; Sat, 13 Apr 2002 02:29:37 -0400 (EDT)\r
-Received: (qmail 11594 invoked by uid 666); 13 Apr 2002 06:29:36 -0000\r
-Received: from unknown (HELO SOL) (\r
- by mail.iinet.net.au with SMTP; 13 Apr 2002 06:29:36 -0000\r
-Message-ID: <002301c1e2b3$804bd000$0200a8c0@SOL>\r
-From: "Christopher Kings-Lynne" <chriskl@familyhealth.com.au>\r
-To: "Barry Lind" <barry@xythos.com>, "Tom Lane" <tgl@sss.pgh.pa.us>\r
-cc: "Karel Zak" <zakkr@zf.jcu.cz>, <pgsql-hackers@postgresql.org>,\r
- "Neil Conway" <nconway@klamath.dyndns.org>\r
-References: <GNELIHDDFBOCMGBFGEFOGEBHCCAA.chriskl@familyhealth.com.au> <3CB52C54.4020507@freaky-namuh.com> <20020411115434.201ff92f.nconway@klamath.dyndns.org> <3CB61DAB.5010601@freaky-namuh.com> <24184.1018581907@sss.pgh.pa.us> <3CB65B49.93F2F790@tpf.co.jp> <20020412004134.5d35a2dd.nconway@klamath.dyndns.org> <20020412095116.B6370@zf.jcu.cz> <27235.1018620866@sss.pgh.pa.us> <3CB70E7C.3090801@xythos.com>\r
-Subject: Re: [HACKERS] 7.3 schedule\r
-Date: Sat, 13 Apr 2002 14:21:50 +0800\r
-MIME-Version: 1.0\r
-Content-Type: text/plain;\r
- charset="iso-8859-1"\r
-Content-Transfer-Encoding: 7bit\r
-X-Priority: 3\r
-X-MSMail-Priority: Normal\r
-X-Mailer: Microsoft Outlook Express 5.50.4522.1200\r
-X-MimeOLE: Produced By Microsoft MimeOLE V5.50.4522.1200\r
-Precedence: bulk\r
-Sender: pgsql-hackers-owner@postgresql.org\r
-Status: OR\r
-> > thought out way of predicting/limiting their size. (2) How the heck do
-> > you get rid of obsoleted cached plans, if the things stick around in
-> > shared memory even after you start a new backend? (3) A shared cache
-> > requires locking; contention among multiple backends to access that
-> > shared resource could negate whatever performance benefit you might hope
-> > to realize from it.
-I don't understand all these locking problems? Surely the only lock a
-transaction would need on a stored query is one that prevents the cache
-invalidation mechanism from deleting it out from under it? Surely this
-means that there would be tonnes of readers on the cache - none of them
-blocking each other, and the odd invalidation event that needs a complete
-Also, as for invalidation, there probably could be just two reasons to
-invalidate a query in the cache. (1) The cache is running out of space and
-you use LRU or something to remove old queries, or (2) someone runs ANALYZE,
-in which case all cached queries should just be flushed? If they specify an
-actual table to analyze, then just drop all queries on the table.
-Could this cache mechanism be used to make views fast as well? You could
-cache the queries that back views on first use, and then they can follow the
-above rules for flushing...
----------------------------(end of broadcast)---------------------------
-TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
-From pgsql-hackers-owner+M21276@postgresql.org Sat Apr 13 11:48:51 2002
-Return-path: <pgsql-hackers-owner+M21276@postgresql.org>\r
-Received: from postgresql.org (postgresql.org [])\r
- by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g3DFmoS27879\r
- for <pgman@candle.pha.pa.us>; Sat, 13 Apr 2002 11:48:51 -0400 (EDT)\r
-Received: from postgresql.org (postgresql.org [])\r
- by postgresql.org (Postfix) with SMTP\r
- id 9EB81475C5C; Sat, 13 Apr 2002 11:46:52 -0400 (EDT)\r
-Received: from sss.pgh.pa.us (unknown [])\r
- by postgresql.org (Postfix) with ESMTP id 0FE0B474E78\r
- for <pgsql-hackers@postgresql.org>; Sat, 13 Apr 2002 11:46:09 -0400 (EDT)\r
-Received: from sss2.sss.pgh.pa.us (tgl@localhost [])\r
- by sss.pgh.pa.us (8.11.4/8.11.4) with ESMTP id g3DFk2F15743;\r
- Sat, 13 Apr 2002 11:46:02 -0400 (EDT)\r
-To: "Christopher Kings-Lynne" <chriskl@familyhealth.com.au>\r
-cc: "Barry Lind" <barry@xythos.com>, "Karel Zak" <zakkr@zf.jcu.cz>,\r
- pgsql-hackers@postgresql.org, "Neil Conway" <nconway@klamath.dyndns.org>\r
-Subject: Re: [HACKERS] 7.3 schedule \r
-In-Reply-To: <002301c1e2b3$804bd000$0200a8c0@SOL> \r
-References: <GNELIHDDFBOCMGBFGEFOGEBHCCAA.chriskl@familyhealth.com.au> <3CB52C54.4020507@freaky-namuh.com> <20020411115434.201ff92f.nconway@klamath.dyndns.org> <3CB61DAB.5010601@freaky-namuh.com> <24184.1018581907@sss.pgh.pa.us> <3CB65B49.93F2F790@tpf.co.jp> <20020412004134.5d35a2dd.nconway@klamath.dyndns.org> <20020412095116.B6370@zf.jcu.cz> <27235.1018620866@sss.pgh.pa.us> <3CB70E7C.3090801@xythos.com> <002301c1e2b3$804bd000$0200a8c0@SOL>\r
-Comments: In-reply-to "Christopher Kings-Lynne" <chriskl@familyhealth.com.au>\r
- message dated "Sat, 13 Apr 2002 14:21:50 +0800"\r
-Date: Sat, 13 Apr 2002 11:46:01 -0400\r
-Message-ID: <15740.1018712761@sss.pgh.pa.us>\r
-From: Tom Lane <tgl@sss.pgh.pa.us>\r
-Precedence: bulk\r
-Sender: pgsql-hackers-owner@postgresql.org\r
-Status: OR\r
-"Christopher Kings-Lynne" <chriskl@familyhealth.com.au> writes:
-> thought out way of predicting/limiting their size. (2) How the heck do
-> you get rid of obsoleted cached plans, if the things stick around in
-> shared memory even after you start a new backend? (3) A shared cache
-> requires locking; contention among multiple backends to access that
-> shared resource could negate whatever performance benefit you might hope
-> to realize from it.
-> I don't understand all these locking problems?
-Searching the cache and inserting/deleting entries in the cache probably
-have to be mutually exclusive; concurrent insertions probably won't work
-either (at least not without a remarkably intelligent data structure).
-Unless the cache hit rate is remarkably high, there are going to be lots
-of insertions --- and, at steady state, an equal rate of deletions ---
-leading to lots of contention.
-This could possibly be avoided if the cache is not used for all query
-plans but only for explicitly PREPAREd plans, so that only explicit
-EXECUTEs would need to search it. But that approach also makes a
-sizable dent in the usefulness of the cache to begin with.
- regards, tom lane
----------------------------(end of broadcast)---------------------------
-TIP 2: you can get off all lists at once with the unregister command
- (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
-From pgsql-hackers-owner+M21280@postgresql.org Sat Apr 13 14:36:34 2002
-Return-path: <pgsql-hackers-owner+M21280@postgresql.org>\r
-Received: from postgresql.org (postgresql.org [])\r
- by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g3DIaYS10293\r
- for <pgman@candle.pha.pa.us>; Sat, 13 Apr 2002 14:36:34 -0400 (EDT)\r
-Received: from postgresql.org (postgresql.org [])\r
- by postgresql.org (Postfix) with SMTP\r
- id AA151475BB1; Sat, 13 Apr 2002 14:36:17 -0400 (EDT)\r
-Received: from klamath.dyndns.org (CPE002078144ae0.cpe.net.cable.rogers.com [])\r
- by postgresql.org (Postfix) with ESMTP id 42993475BCB\r
- for <pgsql-hackers@postgresql.org>; Sat, 13 Apr 2002 14:35:42 -0400 (EDT)\r
-Received: from jiro (jiro [])\r
- by klamath.dyndns.org (Postfix) with SMTP\r
- id 82B84700C; Sat, 13 Apr 2002 14:35:42 -0400 (EDT)\r
-Date: Sat, 13 Apr 2002 14:35:39 -0400\r
-From: Neil Conway <nconway@klamath.dyndns.org>\r
-To: "Christopher Kings-Lynne" <chriskl@familyhealth.com.au>\r
-cc: barry@xythos.com, tgl@sss.pgh.pa.us, zakkr@zf.jcu.cz,\r
- pgsql-hackers@postgresql.org\r
-Subject: Re: [HACKERS] 7.3 schedule\r
-Message-ID: <20020413143539.7818bf7d.nconway@klamath.dyndns.org>\r
-In-Reply-To: <002301c1e2b3$804bd000$0200a8c0@SOL>\r
-References: <GNELIHDDFBOCMGBFGEFOGEBHCCAA.chriskl@familyhealth.com.au>\r
- <3CB52C54.4020507@freaky-namuh.com>\r
- <20020411115434.201ff92f.nconway@klamath.dyndns.org>\r
- <3CB61DAB.5010601@freaky-namuh.com>\r
- <24184.1018581907@sss.pgh.pa.us>\r
- <3CB65B49.93F2F790@tpf.co.jp>\r
- <20020412004134.5d35a2dd.nconway@klamath.dyndns.org>\r
- <20020412095116.B6370@zf.jcu.cz>\r
- <27235.1018620866@sss.pgh.pa.us>\r
- <3CB70E7C.3090801@xythos.com>\r
- <002301c1e2b3$804bd000$0200a8c0@SOL>\r
-X-Mailer: Sylpheed version 0.7.4 (GTK+ 1.2.10; i386-debian-linux-gnu)\r
-MIME-Version: 1.0\r
-Content-Type: text/plain; charset=US-ASCII\r
-Content-Transfer-Encoding: 7bit\r
-Precedence: bulk\r
-Sender: pgsql-hackers-owner@postgresql.org\r
-Status: OR\r
-On Sat, 13 Apr 2002 14:21:50 +0800
-"Christopher Kings-Lynne" <chriskl@familyhealth.com.au> wrote:
-> Could this cache mechanism be used to make views fast as well?
-The current PREPARE/EXECUTE code will speed up queries that use
-rules of any kind, including views: the query plan is cached after
-it has been rewritten as necessary, so (AFAIK) this should mean
-that rules will be evaluated once when the query is PREPAREd, and
-then cached for subsequent EXECUTE commands.
-Neil Conway <neilconway@rogers.com>
----------------------------(end of broadcast)---------------------------
-TIP 2: you can get off all lists at once with the unregister command
- (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
-From pgsql-hackers-owner+M21309@postgresql.org Sun Apr 14 15:22:44 2002
-Return-path: <pgsql-hackers-owner+M21309@postgresql.org>\r
-Received: from postgresql.org (postgresql.org [])\r
- by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g3EJMiS24239\r
- for <pgman@candle.pha.pa.us>; Sun, 14 Apr 2002 15:22:44 -0400 (EDT)\r
-Received: from postgresql.org (postgresql.org [])\r
- by postgresql.org (Postfix) with SMTP\r
- id 44BAC475E05; Sun, 14 Apr 2002 15:22:42 -0400 (EDT)\r
-Received: from ara.zf.jcu.cz (ara.zf.jcu.cz [])\r
- by postgresql.org (Postfix) with ESMTP id 3CD03475925\r
- for <pgsql-hackers@postgresql.org>; Sun, 14 Apr 2002 15:21:58 -0400 (EDT)\r
-Received: from ara.zf.jcu.cz (LOCALHOST [])\r
- by ara.zf.jcu.cz (8.12.1/8.12.1/Debian -5) with ESMTP id g3EJLiBK012612;\r
- Sun, 14 Apr 2002 21:21:44 +0200\r
-Received: (from zakkr@localhost)\r
- by ara.zf.jcu.cz (8.12.1/8.12.1/Debian -5) id g3EJLi3k012611;\r
- Sun, 14 Apr 2002 21:21:44 +0200\r
-Date: Sun, 14 Apr 2002 21:21:44 +0200\r
-From: Karel Zak <zakkr@zf.jcu.cz>\r
-To: Tom Lane <tgl@sss.pgh.pa.us>\r
-cc: Bruce Momjian <pgman@candle.pha.pa.us>, pgsql-hackers@postgresql.org,\r
- Neil Conway <nconway@klamath.dyndns.org>\r
-Subject: Re: [HACKERS] 7.3 schedule\r
-Message-ID: <20020414212144.A12196@zf.jcu.cz>\r
-References: <200204121621.g3CGL4310492@candle.pha.pa.us> <27964.1018630286@sss.pgh.pa.us>\r
-MIME-Version: 1.0\r
-Content-Type: text/plain; charset=us-ascii\r
-Content-Disposition: inline\r
-User-Agent: Mutt/1.2.5i\r
-In-Reply-To: <27964.1018630286@sss.pgh.pa.us>; from tgl@sss.pgh.pa.us on Fri, Apr 12, 2002 at 12:51:26PM -0400\r
-Precedence: bulk\r
-Sender: pgsql-hackers-owner@postgresql.org\r
-Status: OR\r
-On Fri, Apr 12, 2002 at 12:51:26PM -0400, Tom Lane wrote:
-> Bruce Momjian <pgman@candle.pha.pa.us> writes:
-> > Certainly a shared cache would be good for apps that connect to issue a
-> > single query frequently. In such cases, there would be no local cache
-> > to use.
-> We have enough other problems with the single-query-per-connection
-> scenario that I see no reason to believe that a shared plan cache will
-> help materially. The correct answer for those folks will *always* be
-> to find a way to reuse the connection.
- My query cache was write for 7.0. If some next release will use
- pre-forked backend and after a client disconnection the backend will
- still alives and waits for new client the shared cache is (maybe:-) not
- needful. The current backend fork model is killer of all possible
- caching.
- We have more caches. I hope persistent backend help will help to all
- and I'm sure that speed will grow up with persistent backend and
- persistent caches without shared memory usage. There I can agree with
- Tom :-)
- Karel
- Karel Zak <zakkr@zf.jcu.cz>
- http://home.zf.jcu.cz/~zakkr/
- C, PostgreSQL, PHP, WWW, http://docs.linux.cz, http://mape.jcu.cz
----------------------------(end of broadcast)---------------------------
-TIP 4: Don't 'kill -9' the postmaster
-From pgsql-hackers-owner+M21321@postgresql.org Sun Apr 14 20:40:08 2002
-Return-path: <pgsql-hackers-owner+M21321@postgresql.org>\r
-Received: from postgresql.org (postgresql.org [])\r
- by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g3F0e7S29723\r
- for <pgman@candle.pha.pa.us>; Sun, 14 Apr 2002 20:40:07 -0400 (EDT)\r
-Received: from postgresql.org (postgresql.org [])\r
- by postgresql.org (Postfix) with SMTP\r
- id 3B5FB475DC5; Sun, 14 Apr 2002 20:40:03 -0400 (EDT)\r
-Received: from localhost.localdomain (bgp01077650bgs.wanarb01.mi.comcast.net [])\r
- by postgresql.org (Postfix) with ESMTP id 7B1D3474E71\r
- for <pgsql-hackers@postgresql.org>; Sun, 14 Apr 2002 20:39:18 -0400 (EDT)\r
-Received: from localhost (camber@localhost)\r
- by localhost.localdomain (8.11.6/8.11.6) with ESMTP id g3F0cmD10631;\r
- Sun, 14 Apr 2002 20:38:48 -0400\r
-X-Authentication-Warning: localhost.localdomain: camber owned process doing -bs\r
-Date: Sun, 14 Apr 2002 20:38:48 -0400 (EDT)\r
-From: Brian Bruns <camber@ais.org>\r
-X-X-Sender: <camber@localhost.localdomain>\r
-To: Hannu Krosing <hannu@tm.ee>\r
-cc: <pgsql-hackers@postgresql.org>\r
-Subject: Re: [HACKERS] 7.3 schedule\r
-In-Reply-To: <1018704763.1784.1.camel@taru.tm.ee>\r
-Message-ID: <Pine.LNX.4.33.0204142027180.9523-100000@localhost.localdomain>\r
-MIME-Version: 1.0\r
-Content-Type: TEXT/PLAIN; charset=US-ASCII\r
-Precedence: bulk\r
-Sender: pgsql-hackers-owner@postgresql.org\r
-Status: OR\r
-On 13 Apr 2002, Hannu Krosing wrote:
-> On Fri, 2002-04-12 at 03:04, Brian Bruns wrote:
-> > On 11 Apr 2002, Hannu Krosing wrote:
-> >
-> > > IIRC someone started work on modularising the network-related parts with
-> > > a goal of supporting DRDA (DB2 protocol) and others in future.
-> >
-> > That was me, although I've been bogged down lately, and haven't been able
-> > to get back to it.
-> Has any of your modularisation work got into CVS yet ?
-No, Bruce didn't like the way I did certain things, and had some qualms
-about the value of supporting multiple wire protocols IIRC. Plus the
-patch was not really ready for primetime yet.
-I'm hoping to get back to it soon and sync it with the latest CVS, and
-clean up the odds and ends.
-> > DRDA, btw, is not just a DB2 protocol but an opengroup
-> > spec that hopefully will someday be *the* standard on the wire database
-> > protocol. DRDA handles prepare/execute and is completely binary in
-> > representation, among other advantages.
-> What about extensibility - is there some predefined way of adding new
-> types ?
-Not really, there is some ongoing standards activity adding some new
-features. The list of supported types is pretty impressive, anything in
-particular you are looking for?
-> Also, does it handle NOTIFY ?
-I don't know the answer to this. The spec is pretty huge, so it may, but
-I haven't seen it.
-Even if it is supported as a secondary protocol, I believe there is alot
-of value in having a single database protocol standard. (why else would I
-be doing it!). I'm also looking into what it will take to do the same for
-MySQL and Firebird. Hopefully they will be receptive to the idea as well.
-> ----------------
-> Hannu
----------------------------(end of broadcast)---------------------------
-TIP 5: Have you checked our extensive FAQ?
+++ /dev/null
-From goran@kirra.net Mon Dec 20 14:30:54 1999
-Received: from villa.bildbasen.se (villa.bildbasen.se [])
- by candle.pha.pa.us (8.9.0/8.9.0) with SMTP id PAA29058
- for <pgman@candle.pha.pa.us>; Mon, 20 Dec 1999 15:30:17 -0500 (EST)
-Received: (qmail 2485 invoked from network); 20 Dec 1999 20:29:53 -0000
-Received: from a112.dial.kiruna.se (HELO kirra.net) (
- by villa.bildbasen.se with SMTP; 20 Dec 1999 20:29:53 -0000
-Sender: goran
-Message-ID: <385E9192.226CC37D@kirra.net>
-Date: Mon, 20 Dec 1999 21:29:06 +0100
-From: Goran Thyni <goran@kirra.net>
-Organization: kirra.net
-X-Mailer: Mozilla 4.6 [en] (X11; U; Linux 2.2.13 i586)
-X-Accept-Language: sv, en
-MIME-Version: 1.0
-To: Bruce Momjian <pgman@candle.pha.pa.us>
-CC: "neil d. quiogue" <nquiogue@ieee.org>,
- PostgreSQL-development <pgsql-hackers@postgreSQL.org>
-Subject: Re: [HACKERS] Re: QUESTION: Replication
-References: <199912201508.KAA20572@candle.pha.pa.us>
-Content-Type: text/plain; charset=iso-8859-1
-Content-Transfer-Encoding: 8bit
-Status: OR
-Bruce Momjian wrote:
-> We need major work in this area, or at least a plan and an FAQ item.
-> We are getting major questions on this, and I don't know enough even to
-> make an FAQ item telling people their options.
-My 2 cents, or 2 ören since I'm a Swede, on this:
-It is pretty simple to build a replication with pg_dump, transfer,
-empty replic and reload.
-But if we want "live replicas" we better base our efforts on a
-mechanism using WAL-logs to rollforward the replicas.
-Göran Thyni
-On quiet nights you can hear Windows NT reboot!
-From owner-pgsql-hackers@hub.org Fri Dec 24 10:01:18 1999
-Received: from renoir.op.net (root@renoir.op.net [])
- by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id LAA11295
- for <pgman@candle.pha.pa.us>; Fri, 24 Dec 1999 11:01:17 -0500 (EST)
-Received: from hub.org (hub.org []) by renoir.op.net (o1/$Revision: 1.10 $) with ESMTP id KAA20310 for <pgman@candle.pha.pa.us>; Fri, 24 Dec 1999 10:39:18 -0500 (EST)
-Received: from localhost (majordom@localhost)
- by hub.org (8.9.3/8.9.3) with SMTP id KAA61760;
- Fri, 24 Dec 1999 10:31:13 -0500 (EST)
- (envelope-from owner-pgsql-hackers)
-Received: by hub.org (bulk_mailer v1.5); Fri, 24 Dec 1999 10:30:48 -0500
-Received: (from majordom@localhost)
- by hub.org (8.9.3/8.9.3) id KAA58879
- for pgsql-hackers-outgoing; Fri, 24 Dec 1999 10:29:51 -0500 (EST)
- (envelope-from owner-pgsql-hackers@postgreSQL.org)
-Received: from bocs170n.black-oak.COM ([])
- by hub.org (8.9.3/8.9.3) with ESMTP id KAA58795
- for <pgsql-hackers@postgreSQL.org>; Fri, 24 Dec 1999 10:29:00 -0500 (EST)
- (envelope-from DWalker@black-oak.com)
-From: DWalker@black-oak.com
-To: pgsql-hackers@postgreSQL.org
-Subject: [HACKERS] database replication
-Date: Fri, 24 Dec 1999 10:27:59 -0500
-Message-ID: <OFD38C9424.B391F434-ON85256851.0054F41A@black-oak.COM>
-X-Priority: 3 (Normal)
-X-MIMETrack: Serialize by Router on notes01n/BOCS(Release 5.0.1|July 16, 1999) at 12/24/99
- 10:28:01 AM
-MIME-Version: 1.0
-MIME-Version: 1.0
-Content-Type: text/html; charset=ISO-8859-1
-Content-Transfer-Encoding: quoted-printable
-Sender: owner-pgsql-hackers@postgreSQL.org
-Status: OR
-<P>I've been toying with the idea of implementing database replication for =
-the last few days. The system I'm proposing will be a seperate progra=
-m which can be run on any machine and will most likely be implemented in Py=
-thon. What I'm looking for at this point are gaping holes in my think=
-ing/logic/etc. Here's what I'm thinking...</P><P> </P><P>1) I wa=
-nt to make this program an additional layer over PostgreSQL. I really=
- don't want to hack server code if I can get away with it. At this po=
-int I don't feel I need to.</P><P>2) The replication system will need to ad=
-d at least one field to each table in each database that needs to be replic=
-ated. This field will be a date/time stamp which identifies the "=
-;last update" of the record. This field will be called PGR=5FTIM=
-E for lack of a better name. Because this field will be used from wit=
-hin programs and triggers it can be longer so as to not mistake it for a us=
-er field.</P><P>3) For each table to be replicated the replication system w=
-ill programatically add one plpgsql function and trigger to modify the PGR=
-=5FTIME field on both UPDATEs and INSERTs. The name of this function =
-and trigger will be along the lines of <table=5Fname>=5Freplication=
-=5Fupdate=5Ftrigger and <table=5Fname>=5Freplication=5Fupdate=5Ffunct=
-ion. The function is a simple two-line chunk of code to set the field=
- PGR=5FTIME equal to NOW. The trigger is called before each insert/up=
-date. When looking at the Docs I see that times are stored in Zulu (G=
-T) time. Because of this I don't have to worry about time zones and t=
-he like. I need direction on this part (such as "hey dummy, look=
- at page N of file X.").</P><P>4) At this point we have tables which c=
-an, at a basic level, tell the replication system when they were last updat=
-ed.</P><P>5) The replication system will have a database of its own to reco=
-rd the last replication event, hold configuration, logs, etc. I'd pre=
-fer to store the configuration in a PostgreSQL table but it could just as e=
-asily be stored in a text file on the filesystem somewhere.</P><P>6) To han=
-dle replication I basically check the local "last replication time&quo=
-t; and compare it against the remote PGR=5FTIME fields. If the remote=
- PGR=5FTIME is greater than the last replication time then change the local=
- copy of the database, otherwise, change the remote end of the database. &n=
-bsp;At this point I don't have a way to know WHICH field changed between th=
-e two replicas so either I do ROW level replication or I check each field. =
- I check PGR=5FTIME to determine which field is the most current. &nbs=
-p;Some fine tuning of this process will have to occur no doubt.</P><P>7) Th=
-e commandline utility, fired off by something like cron, could run several =
-times during the day -- command line parameters can be implemented to say P=
-;</P><P>Questions/Concerns:</P><P>1) How far do I go with this? Do I =
-start manhandling the system catalogs (pg=5F* tables)?</P><P>2) As to #2 an=
-d #3 above, I really don't like tools automagically changing my tables but =
-at this point I don't see a way around it. I guess this is where the =
-testing comes into play.</P><P>3) Security: the replication app will have t=
-o have pretty good rights to the database so it can add the nessecary funct=
-ions and triggers, modify table schema, etc. </P><P> </P><P>&nbs=
-p; So, any "you're insane and should run home to momma" comments?=
-</P><P> </P><P> Damond=
-From owner-pgsql-hackers@hub.org Fri Dec 24 18:31:03 1999
-Received: from renoir.op.net (root@renoir.op.net [])
- by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id TAA26244
- for <pgman@candle.pha.pa.us>; Fri, 24 Dec 1999 19:31:02 -0500 (EST)
-Received: from hub.org (hub.org []) by renoir.op.net (o1/$Revision: 1.10 $) with ESMTP id TAA12730 for <pgman@candle.pha.pa.us>; Fri, 24 Dec 1999 19:30:05 -0500 (EST)
-Received: from localhost (majordom@localhost)
- by hub.org (8.9.3/8.9.3) with SMTP id TAA57851;
- Fri, 24 Dec 1999 19:23:31 -0500 (EST)
- (envelope-from owner-pgsql-hackers)
-Received: by hub.org (bulk_mailer v1.5); Fri, 24 Dec 1999 19:22:54 -0500
-Received: (from majordom@localhost)
- by hub.org (8.9.3/8.9.3) id TAA57710
- for pgsql-hackers-outgoing; Fri, 24 Dec 1999 19:21:56 -0500 (EST)
- (envelope-from owner-pgsql-hackers@postgreSQL.org)
-Received: from Mail.austin.rr.com (sm2.texas.rr.com [])
- by hub.org (8.9.3/8.9.3) with ESMTP id TAA57680
- for <pgsql-hackers@postgresql.org>; Fri, 24 Dec 1999 19:21:25 -0500 (EST)
- (envelope-from ELOEHR@austin.rr.com)
-Received: from austin.rr.com ([]) by Mail.austin.rr.com with Microsoft SMTPSVC(5.5.1877.197.19);
- Fri, 24 Dec 1999 18:12:50 -0600
-Message-ID: <38640E2D.75136600@austin.rr.com>
-Date: Fri, 24 Dec 1999 18:22:05 -0600
-From: Ed Loehr <ELOEHR@austin.rr.com>
-X-Mailer: Mozilla 4.7 [en] (X11; U; Linux 2.2.12-20smp i686)
-X-Accept-Language: en
-MIME-Version: 1.0
-To: DWalker@black-oak.com
-CC: pgsql-hackers@postgreSQL.org
-Subject: Re: [HACKERS] database replication
-References: <OFD38C9424.B391F434-ON85256851.0054F41A@black-oak.COM>
-Content-Type: text/plain; charset=us-ascii
-Content-Transfer-Encoding: 7bit
-Sender: owner-pgsql-hackers@postgreSQL.org
-Status: OR
-DWalker@black-oak.com wrote:
-> 6) To handle replication I basically check the local "last
-> replication time" and compare it against the remote PGR_TIME
-> fields. If the remote PGR_TIME is greater than the last replication
-> time then change the local copy of the database, otherwise, change
-> the remote end of the database. At this point I don't have a way to
-> know WHICH field changed between the two replicas so either I do ROW
-> level replication or I check each field. I check PGR_TIME to
-> determine which field is the most current. Some fine tuning of this
-> process will have to occur no doubt.
-Interesting idea. I can see how this might sync up two databases
-somehow. For true replication, however, I would always want every
-replicated database to be, at the very least, internally consistent
-(i.e., referential integrity), even if it was a little behind on
-processing transactions. In this method, its not clear how
-consistency is every achieved/guaranteed at any point in time if the
-input stream of changes is continuous. If the input stream ceased,
-then I can see how this approach might eventually catch up and totally
-resync everything, but it looks *very* computationally expensive.
-But I might have missed something. How would internal consistency be
-> 7) The commandline utility, fired off by something like cron, could
-> run several times during the day -- command line parameters can be
-My two cents is that, while I can see this kind of database syncing as
-valuable, this is not the kind of "replication" I had in mind. This
-may already possible by simply copying the database. What replication
-means to me is a live, continuously streaming sequence of updates from
-one database to another where the replicated database is always
-internally consistent, available for read-only queries, and never "too
-far" out of sync with the source/primary database.
-What does replication mean to others?
-Ed Loehr
-From owner-pgsql-hackers@hub.org Fri Dec 24 21:31:10 1999
-Received: from renoir.op.net (root@renoir.op.net [])
- by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id WAA02578
- for <pgman@candle.pha.pa.us>; Fri, 24 Dec 1999 22:31:09 -0500 (EST)
-Received: from hub.org (hub.org []) by renoir.op.net (o1/$Revision: 1.10 $) with ESMTP id WAA16641 for <pgman@candle.pha.pa.us>; Fri, 24 Dec 1999 22:18:56 -0500 (EST)
-Received: from localhost (majordom@localhost)
- by hub.org (8.9.3/8.9.3) with SMTP id WAA89135;
- Fri, 24 Dec 1999 22:11:12 -0500 (EST)
- (envelope-from owner-pgsql-hackers)
-Received: by hub.org (bulk_mailer v1.5); Fri, 24 Dec 1999 22:10:56 -0500
-Received: (from majordom@localhost)
- by hub.org (8.9.3/8.9.3) id WAA89019
- for pgsql-hackers-outgoing; Fri, 24 Dec 1999 22:09:59 -0500 (EST)
- (envelope-from owner-pgsql-hackers@postgreSQL.org)
-Received: from bocs170n.black-oak.COM ([])
- by hub.org (8.9.3/8.9.3) with ESMTP id WAA88957;
- Fri, 24 Dec 1999 22:09:11 -0500 (EST)
- (envelope-from dwalker@black-oak.com)
-Received: from gcx80 ([])
- by bocs170n.black-oak.COM (Lotus Domino Release 5.0.1)
- with SMTP id 1999122422080835:6 ;
- Fri, 24 Dec 1999 22:08:08 -0500
-Message-ID: <001b01bf4e9e$647287d0$af63a8c0@walkers.org>
-From: "Damond Walker" <dwalker@black-oak.com>
-To: <owner-pgsql-hackers@postgreSQL.org>
-Cc: <pgsql-hackers@postgreSQL.org>
-References: <OFD38C9424.B391F434-ON85256851.0054F41A@black-oak.COM> <38640E2D.75136600@austin.rr.com>
-Subject: Re: [HACKERS] database replication
-Date: Fri, 24 Dec 1999 22:07:55 -0800
-MIME-Version: 1.0
-X-Priority: 3 (Normal)
-X-MSMail-Priority: Normal
-X-Mailer: Microsoft Outlook Express 5.00.2314.1300
-X-MimeOLE: Produced By Microsoft MimeOLE V5.00.2314.1300
-X-MIMETrack: Itemize by SMTP Server on notes01n/BOCS(Release 5.0.1|July 16, 1999) at 12/24/99
- 10:08:09 PM,
- Serialize by Router on notes01n/BOCS(Release 5.0.1|July 16, 1999) at 12/24/99
- 10:08:11 PM,
- Serialize complete at 12/24/99 10:08:11 PM
-Content-Transfer-Encoding: 7bit
-Content-Type: text/plain;
- charset="iso-8859-1"
-Sender: owner-pgsql-hackers@postgreSQL.org
-Status: OR
-> Interesting idea. I can see how this might sync up two databases
-> somehow. For true replication, however, I would always want every
-> replicated database to be, at the very least, internally consistent
-> (i.e., referential integrity), even if it was a little behind on
-> processing transactions. In this method, its not clear how
-> consistency is every achieved/guaranteed at any point in time if the
-> input stream of changes is continuous. If the input stream ceased,
-> then I can see how this approach might eventually catch up and totally
-> resync everything, but it looks *very* computationally expensive.
- What's the typical unit of work for the database? Are we talking about
-update transactions which span the entire DB? Or are we talking about
-updating maybe 1% or less of the database everyday? I'd think it would be
-more towards the latter than the former. So, yes, this process would be
-computationally expensive but how many records would actually have to be
-sent back and forth?
-> But I might have missed something. How would internal consistency be
-> maintained?
- Updates that occur at site A will be moved to site B and vice versa.
-Consistency would be maintained. The only problem that I can see right off
-the bat would be what if site A and site B made changes to a row and then
-site C was brought into the picture? Which one wins?
- Someone *has* to win when it comes to this type of thing. You really
-DON'T want to start merging row changes...
-> My two cents is that, while I can see this kind of database syncing as
-> valuable, this is not the kind of "replication" I had in mind. This
-> may already possible by simply copying the database. What replication
-> means to me is a live, continuously streaming sequence of updates from
-> one database to another where the replicated database is always
-> internally consistent, available for read-only queries, and never "too
-> far" out of sync with the source/primary database.
- Sounds like you're talking about distributed transactions to me. That's
-an entirely different subject all-together. What you describe can be done
-by copying a database...but as you say, this would only work in a read-only
- Damond
-From owner-pgsql-hackers@hub.org Sat Dec 25 16:35:07 1999
-Received: from hub.org (hub.org [])
- by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id RAA28890
- for <pgman@candle.pha.pa.us>; Sat, 25 Dec 1999 17:35:05 -0500 (EST)
-Received: from localhost (majordom@localhost)
- by hub.org (8.9.3/8.9.3) with SMTP id RAA86997;
- Sat, 25 Dec 1999 17:29:10 -0500 (EST)
- (envelope-from owner-pgsql-hackers)
-Received: by hub.org (bulk_mailer v1.5); Sat, 25 Dec 1999 17:28:09 -0500
-Received: (from majordom@localhost)
- by hub.org (8.9.3/8.9.3) id RAA86863
- for pgsql-hackers-outgoing; Sat, 25 Dec 1999 17:27:11 -0500 (EST)
- (envelope-from owner-pgsql-hackers@postgreSQL.org)
-Received: from mtiwmhc08.worldnet.att.net (mtiwmhc08.worldnet.att.net [])
- by hub.org (8.9.3/8.9.3) with ESMTP id RAA86798
- for <pgsql-hackers@postgreSQL.org>; Sat, 25 Dec 1999 17:26:34 -0500 (EST)
- (envelope-from pgsql@rkirkpat.net)
-Received: from [] ([])
- by mtiwmhc08.worldnet.att.net (InterMail v03.02.07.07 118-134)
- with ESMTP id <19991225222554.VIOL28505@[]>;
- Sat, 25 Dec 1999 22:25:54 +0000
-Date: Sat, 25 Dec 1999 15:25:47 -0700 (MST)
-From: Ryan Kirkpatrick <pgsql@rkirkpat.net>
-X-Sender: rkirkpat@excelsior.rkirkpat.net
-To: DWalker@black-oak.com
-cc: pgsql-hackers@postgreSQL.org
-Subject: Re: [HACKERS] database replication
-In-Reply-To: <OFD38C9424.B391F434-ON85256851.0054F41A@black-oak.COM>
-Message-ID: <Pine.LNX.4.10.9912251433310.1551-100000@excelsior.rkirkpat.net>
-MIME-Version: 1.0
-Content-Type: TEXT/PLAIN; charset=US-ASCII
-Sender: owner-pgsql-hackers@postgreSQL.org
-Status: OR
-On Fri, 24 Dec 1999 DWalker@black-oak.com wrote:
-> I've been toying with the idea of implementing database replication
-> for the last few days.
- I too have been thinking about this some over the last year or
-two, just trying to find a quick and easy way to do it. I am not so
-interested in replication, as in synchronization, as in between a desktop
-machine and a laptop, so I can keep the databases on each in sync with
-each other. For this sort of purpose, both the local and remote databases
-would be "idle" at the time of syncing.
-> 2) The replication system will need to add at least one field to each
-> table in each database that needs to be replicated. This field will be
-> a date/time stamp which identifies the "last update" of the record.
-> This field will be called PGR_TIME for lack of a better name.
-> Because this field will be used from within programs and triggers it
-> can be longer so as to not mistake it for a user field.
- How about a single, seperate table with the fields of 'database',
-'tablename', 'oid', 'last_changed', that would store the same data as your
-PGR_TIME field. It would be seperated from the actually data tables, and
-therefore would be totally transparent to any database interface
-applications. The 'oid' field would hold each row's OID, a nice, unique
-identification number for the row, while the other fields would tell which
-table and database the oid is in. Then this table can be compared with the
-this table on a remote machine to quickly find updates and changes, then
-each differences can be dealt with in turn.
-> 3) For each table to be replicated the replication system will
-> programatically add one plpgsql function and trigger to modify the
-> PGR_TIME field on both UPDATEs and INSERTs. The name of this function
-> and trigger will be along the lines of
-> <table_name>_replication_update_trigger and
-> <table_name>_replication_update_function. The function is a simple
-> two-line chunk of code to set the field PGR_TIME equal to NOW. The
-> trigger is called before each insert/update. When looking at the Docs
-> I see that times are stored in Zulu (GT) time. Because of this I
-> don't have to worry about time zones and the like. I need direction
-> on this part (such as "hey dummy, look at page N of file X.").
- I like this idea, better than any I have come up with yet. Though,
-how are you going to handle DELETEs?
-> 6) To handle replication I basically check the local "last replication
-> time" and compare it against the remote PGR_TIME fields. If the
-> remote PGR_TIME is greater than the last replication time then change
-> the local copy of the database, otherwise, change the remote end of
-> the database. At this point I don't have a way to know WHICH field
-> changed between the two replicas so either I do ROW level replication
-> or I check each field. I check PGR_TIME to determine which field is
-> the most current. Some fine tuning of this process will have to occur
-> no doubt.
- Yea, this is indeed the sticky part, and would indeed require some
-fine-tunning. Basically, the way I see it, is if the two timestamps for a
-single row do not match (or even if the row and therefore timestamp is
-missing on one side or the other altogether):
- local ts > remote ts => Local row is exported to remote.
- remote ts > local ts => Remote row is exported to local.
- local ts > last sync time && no remote ts =>
- Local row is inserted on remote.
- local ts < last sync time && no remote ts =>
- Local row is deleted.
- remote ts > last sync time && no local ts =>
- Remote row is inserted on local.
- remote ts < last sync time && no local ts =>
- Remote row is deleted.
-where the synchronization process is running on the local machine. By
-exported, I mean the local values are sent to the remote machine, and the
-row on that remote machine is updated to the local values. How does this
-> 7) The commandline utility, fired off by something like cron, could
-> run several times during the day -- command line parameters can be
- Or run manually for my purposes. Also, maybe follow it
-with a vacuum run on both sides for all databases, as this is going to
-potenitally cause lots of table changes that could stand with a cleanup.
-> 1) How far do I go with this? Do I start manhandling the system catalogs (pg_* tables)?
- Initially, I would just stick to user table data... If you have
-changes in triggers and other meta-data/executable code, you are going to
-want to make syncs of that stuff manually anyway. At least I would want
-> 2) As to #2 and #3 above, I really don't like tools automagically
-> changing my tables but at this point I don't see a way around it. I
-> guess this is where the testing comes into play.
- Hence the reason for the seperate table with just a row's
-identification and last update time. Only modifications to the synced
-database is the update trigger, which should be pretty harmless.
-> 3) Security: the replication app will have to have pretty good rights
-> to the database so it can add the nessecary functions and triggers,
-> modify table schema, etc.
- Just run the sync program as the postgres super user, and there
-are no problems. :)
-> So, any "you're insane and should run home to momma" comments?
- No, not at all. Though it probably should be remaned from
-replication to synchronization. The former is usually associated with a
-continuous stream of updates between the local and remote databases, so
-they are almost always in sync, and have a queuing ability if their
-connection is loss for span of time as well. Very complex and difficult to
-implement, and would require hacking server code. :( Something only Sybase
-and Oracle have (as far as I know), and from what I have seen of Sybase's
-replication server support (dated by 5yrs) it was a pain to setup and get
-running correctly.
- The latter, synchronization, is much more managable, and can still
-be useful, especially when you have a large database you want in two
-places, mainly for read only purposes at one end or the other, but don't
-want to waste the time/bandwidth to move and load the entire database each
-time it changes on one end or the other. Same idea as mirroring software
-for FTP sites, just transfers the changes, and nothing more.
- I also like the idea of using Python. I have been using it
-recently for some database interfaces (to PostgreSQL of course :), and it
-is a very nice language to work with. Some worries about performance of
-the program though, as python is only an interpreted lanuage, and I have
-yet to really be impressed with the speed of execution of my database
-interfaces yet.
- Anyway, it sound like a good project, and finally one where I
-actually have a clue of what is going on, and the skills to help. So, if
-you are interested in pursing this project, I would be more than glad to
-help. TTYL.
-| "For to me to live is Christ, and to die is gain." |
-| --- Philippians 1:21 (KJV) |
-| Ryan Kirkpatrick | Boulder, Colorado | http://www.rkirkpat.net/ |
-From owner-pgsql-hackers@hub.org Sun Dec 26 08:31:09 1999
-Received: from renoir.op.net (root@renoir.op.net [])
- by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id JAA17976
- for <pgman@candle.pha.pa.us>; Sun, 26 Dec 1999 09:31:07 -0500 (EST)
-Received: from hub.org (hub.org []) by renoir.op.net (o1/$Revision: 1.10 $) with ESMTP id JAA23337 for <pgman@candle.pha.pa.us>; Sun, 26 Dec 1999 09:28:36 -0500 (EST)
-Received: from localhost (majordom@localhost)
- by hub.org (8.9.3/8.9.3) with SMTP id JAA90738;
- Sun, 26 Dec 1999 09:21:58 -0500 (EST)
- (envelope-from owner-pgsql-hackers)
-Received: by hub.org (bulk_mailer v1.5); Sun, 26 Dec 1999 09:19:19 -0500
-Received: (from majordom@localhost)
- by hub.org (8.9.3/8.9.3) id JAA90498
- for pgsql-hackers-outgoing; Sun, 26 Dec 1999 09:18:21 -0500 (EST)
- (envelope-from owner-pgsql-hackers@postgreSQL.org)
-Received: from bocs170n.black-oak.COM ([])
- by hub.org (8.9.3/8.9.3) with ESMTP id JAA90452
- for <pgsql-hackers@postgreSQL.org>; Sun, 26 Dec 1999 09:17:54 -0500 (EST)
- (envelope-from dwalker@black-oak.com)
-Received: from vmware98 ([])
- by bocs170n.black-oak.COM (Lotus Domino Release 5.0.1)
- with SMTP id 1999122609164808:7 ;
- Sun, 26 Dec 1999 09:16:48 -0500
-Message-ID: <002201bf4fb3$623f0220$b263a8c0@vmware98.walkers.org>
-From: "Damond Walker" <dwalker@black-oak.com>
-To: "Ryan Kirkpatrick" <pgsql@rkirkpat.net>
-Cc: <pgsql-hackers@postgreSQL.org>
-Subject: Re: [HACKERS] database replication
-Date: Sun, 26 Dec 1999 10:10:41 -0500
-MIME-Version: 1.0
-X-Priority: 3 (Normal)
-X-MSMail-Priority: Normal
-X-Mailer: Microsoft Outlook Express 4.72.3110.1
-X-MimeOLE: Produced By Microsoft MimeOLE V4.72.3110.3
-X-MIMETrack: Itemize by SMTP Server on notes01n/BOCS(Release 5.0.1|July 16, 1999) at 12/26/99
- 09:16:51 AM,
- Serialize by Router on notes01n/BOCS(Release 5.0.1|July 16, 1999) at 12/26/99
- 09:16:54 AM,
- Serialize complete at 12/26/99 09:16:54 AM
-Content-Transfer-Encoding: 7bit
-Content-Type: text/plain;
- charset="iso-8859-1"
-Sender: owner-pgsql-hackers@postgreSQL.org
-Status: OR
-> I too have been thinking about this some over the last year or
->two, just trying to find a quick and easy way to do it. I am not so
->interested in replication, as in synchronization, as in between a desktop
->machine and a laptop, so I can keep the databases on each in sync with
->each other. For this sort of purpose, both the local and remote databases
->would be "idle" at the time of syncing.
- I don't think it would matter if the databases are idle or not to be
-honest with you. At any single point in time when you replicate I'd figure
-that the database would be in a consistent state. So, you should be able to
-replicate (or sync) a remote database that is in use. After all, you're
-getting a snapshot of the database as it stands at 8:45 PM. At 8:46 PM it
-may be totally different...but the next time syncing takes place those
-changes would appear in your local copy.
- The one problem you may run into is if the remote host is running a
-large batch process. It's very likely that you will get 50% of their
-changes when you replicate...but then again, that's why you can schedule the
-event to work around such things.
-> How about a single, seperate table with the fields of 'database',
->'tablename', 'oid', 'last_changed', that would store the same data as your
->PGR_TIME field. It would be seperated from the actually data tables, and
->therefore would be totally transparent to any database interface
->applications. The 'oid' field would hold each row's OID, a nice, unique
->identification number for the row, while the other fields would tell which
->table and database the oid is in. Then this table can be compared with the
->this table on a remote machine to quickly find updates and changes, then
->each differences can be dealt with in turn.
- The problem with OID's is that they are unique at the local level but if
-you try and use them between servers you can run into overlap. Also, if a
-database is under heavy use this table could quickly become VERY large. Add
-indexes to this table to help performance and you're taking up even more
-disk space.
- Using the PGR_TIME field with an index will allow us to find rows which
-have changed VERY quickly. All we need to do now is somehow programatically
-find the primary key for a table so the person setting up replication (or
-syncing) doesn't have to have an indepth knowledge of the schema in order to
-setup a syncing schedule.
-> I like this idea, better than any I have come up with yet. Though,
->how are you going to handle DELETEs?
- Oops...how about defining a trigger for this? With deletion I guess we
-would have to move a flag into another table saying we deleted record 'X'
-with this primary key from this table.
-> Yea, this is indeed the sticky part, and would indeed require some
->fine-tunning. Basically, the way I see it, is if the two timestamps for a
->single row do not match (or even if the row and therefore timestamp is
->missing on one side or the other altogether):
-> local ts > remote ts => Local row is exported to remote.
-> remote ts > local ts => Remote row is exported to local.
-> local ts > last sync time && no remote ts =>
-> Local row is inserted on remote.
-> local ts < last sync time && no remote ts =>
-> Local row is deleted.
-> remote ts > last sync time && no local ts =>
-> Remote row is inserted on local.
-> remote ts < last sync time && no local ts =>
-> Remote row is deleted.
->where the synchronization process is running on the local machine. By
->exported, I mean the local values are sent to the remote machine, and the
->row on that remote machine is updated to the local values. How does this
- The replication part will be the most complex...that much is for
- I've been writing systems in Lotus Notes/Domino for the last year or so
-and I've grown quite spoiled with what it can do in regards to replication.
-It's not real-time but you have to gear your applications to this type of
-thing (it's possible to create documents, fire off email to notify people of
-changes and have the email arrive before the replicated documents do).
-Replicating large Notes/Domino databases takes quite a while....I don't see
-any kind of replication or syncing running in a blink of an eye.
- Having said that, a good algo will have to be written to cut down on
-network traffic and to keep database conversations down to a minimum. This
-will be appreciated by people with low bandwidth connections I'm sure
-(dial-ups, fractional T1's, etc).
-> Or run manually for my purposes. Also, maybe follow it
->with a vacuum run on both sides for all databases, as this is going to
->potenitally cause lots of table changes that could stand with a cleanup.
- What would a vacuum do to a system being used by many people?
-> No, not at all. Though it probably should be remaned from
->replication to synchronization. The former is usually associated with a
->continuous stream of updates between the local and remote databases, so
->they are almost always in sync, and have a queuing ability if their
->connection is loss for span of time as well. Very complex and difficult to
->implement, and would require hacking server code. :( Something only Sybase
->and Oracle have (as far as I know), and from what I have seen of Sybase's
->replication server support (dated by 5yrs) it was a pain to setup and get
->running correctly.
- It could probably be named either way...but the one thing I really don't
-want to do is start hacking server code. The PostgreSQL people have enough
-to do without worrying about trying to meld anything I've done to their
-server. :)
- Besides, I like the idea of having it operate as a stand-alone product.
-The only PostgreSQL feature we would require would be triggers and
-plpgsql...what was the earliest version of PostgreSQL that supported
-plpgsql? Even then I don't see the triggers being that complex to boot.
-> I also like the idea of using Python. I have been using it
->recently for some database interfaces (to PostgreSQL of course :), and it
->is a very nice language to work with. Some worries about performance of
->the program though, as python is only an interpreted lanuage, and I have
->yet to really be impressed with the speed of execution of my database
->interfaces yet.
- The only thing we'd need for Python is the Python extensions for
-PostgreSQL...which in turn requires libpq and that's about it. So, it
-should be able to run on any platform supported by Python and libpq. Using
-TK for the interface components will require NT people to get additional
-software from the 'net. At least it did with older version of Windows
-Python. Unix folks should be happy....assuming they have X running on the
-machine doing the replication or syncing. Even then I wrote a curses based
-Python interface awhile back which allows buttons, progress bars, input
-fields, etc (I called it tinter and it's available at
-http://iximd.com/~dwalker). It's a simple interface and could probably be
-cleaned up a bit but it works. :)
-> Anyway, it sound like a good project, and finally one where I
->actually have a clue of what is going on, and the skills to help. So, if
->you are interested in pursing this project, I would be more than glad to
->help. TTYL.
- That would be a Good Thing. Have webspace somewhere? If I can get
-permission from the "powers that be" at the office I could host a website on
-our (Domino) webserver.
- Damond
-From owner-pgsql-hackers@hub.org Sun Dec 26 19:11:48 1999
-Received: from hub.org (hub.org [])
- by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id UAA26661
- for <pgman@candle.pha.pa.us>; Sun, 26 Dec 1999 20:11:46 -0500 (EST)
-Received: from localhost (majordom@localhost)
- by hub.org (8.9.3/8.9.3) with SMTP id UAA14959;
- Sun, 26 Dec 1999 20:08:15 -0500 (EST)
- (envelope-from owner-pgsql-hackers)
-Received: by hub.org (bulk_mailer v1.5); Sun, 26 Dec 1999 20:07:27 -0500
-Received: (from majordom@localhost)
- by hub.org (8.9.3/8.9.3) id UAA14820
- for pgsql-hackers-outgoing; Sun, 26 Dec 1999 20:06:28 -0500 (EST)
- (envelope-from owner-pgsql-hackers@postgreSQL.org)
-Received: from mtiwmhc02.worldnet.att.net (mtiwmhc02.worldnet.att.net [])
- by hub.org (8.9.3/8.9.3) with ESMTP id UAA14749
- for <pgsql-hackers@postgreSQL.org>; Sun, 26 Dec 1999 20:05:39 -0500 (EST)
- (envelope-from rkirkpat@rkirkpat.net)
-Received: from [] ([])
- by mtiwmhc02.worldnet.att.net (InterMail v03.02.07.07 118-134)
- with ESMTP id <19991227010506.WJVW1914@[]>;
- Mon, 27 Dec 1999 01:05:06 +0000
-Date: Sun, 26 Dec 1999 18:05:02 -0700 (MST)
-From: Ryan Kirkpatrick <pgsql@rkirkpat.net>
-X-Sender: rkirkpat@excelsior.rkirkpat.net
-To: Damond Walker <dwalker@black-oak.com>
-cc: pgsql-hackers@postgreSQL.org
-Subject: Re: [HACKERS] database replication
-In-Reply-To: <002201bf4fb3$623f0220$b263a8c0@vmware98.walkers.org>
-Message-ID: <Pine.LNX.4.10.9912261742550.7666-100000@excelsior.rkirkpat.net>
-MIME-Version: 1.0
-Content-Type: TEXT/PLAIN; charset=US-ASCII
-Sender: owner-pgsql-hackers@postgreSQL.org
-Status: OR
-On Sun, 26 Dec 1999, Damond Walker wrote:
-> > How about a single, seperate table with the fields of 'database',
-> >'tablename', 'oid', 'last_changed', that would store the same data as your
-> >PGR_TIME field. It would be seperated from the actually data tables, and
-> The problem with OID's is that they are unique at the local level but if
-> you try and use them between servers you can run into overlap.
- Yea, forgot about that point, but became dead obvious once you
-mentioned it. Boy, I feel stupid now. :)
-> Using the PGR_TIME field with an index will allow us to find rows which
-> have changed VERY quickly. All we need to do now is somehow programatically
-> find the primary key for a table so the person setting up replication (or
-> syncing) doesn't have to have an indepth knowledge of the schema in order to
-> setup a syncing schedule.
- Hmm... Yea, maybe look to see which field(s) has a primary, unique
-index on it? Then use those field(s) as a primary key. Just require that
-any table to be synchronized to have some set of fields that uniquely
-identify each row. Either that, or add another field to each table with
-our own, cross system consistent, identification system. Don't know which
-would be more efficient and easier to work with.
- The former could potentially get sticky if it takes a lots of
-fields to generate a unique key value, but has the smallest effect on the
-table to be synced. The latter could be difficult to keep straight between
-systems (local vs. remote), and would require a trigger on inserts to
-generate a new, unique id number, that does not exist locally or
-remotely (nasty issue there), but would remove the uniqueness
-> Oops...how about defining a trigger for this? With deletion I guess we
-> would have to move a flag into another table saying we deleted record 'X'
-> with this primary key from this table.
- Or, according to my logic below, if a row is missing on one side
-or the other, then just compare the remaining row's timestamp to the last
-synchronization time (stored in a seperate table/db elsewhere). The
-results of the comparsion and the state of row existences tell one if the
-row was inserted or deleted since the last sync, and what should be done
-to perform the sync.
-> > Yea, this is indeed the sticky part, and would indeed require some
-> >fine-tunning. Basically, the way I see it, is if the two timestamps for a
-> >single row do not match (or even if the row and therefore timestamp is
-> >missing on one side or the other altogether):
-> > local ts > remote ts => Local row is exported to remote.
-> > remote ts > local ts => Remote row is exported to local.
-> > local ts > last sync time && no remote ts =>
-> > Local row is inserted on remote.
-> > local ts < last sync time && no remote ts =>
-> > Local row is deleted.
-> > remote ts > last sync time && no local ts =>
-> > Remote row is inserted on local.
-> > remote ts < last sync time && no local ts =>
-> > Remote row is deleted.
-> >where the synchronization process is running on the local machine. By
-> >exported, I mean the local values are sent to the remote machine, and the
-> >row on that remote machine is updated to the local values. How does this
-> >sound?
-> Having said that, a good algo will have to be written to cut down on
-> network traffic and to keep database conversations down to a minimum. This
-> will be appreciated by people with low bandwidth connections I'm sure
-> (dial-ups, fractional T1's, etc).
- Of course! In reflection, the assigned identification number I
-mentioned above might be the best then, instead of having to transfer the
-entire set of key fields back and forth.
-> What would a vacuum do to a system being used by many people?
- Probably lock them out of tables while they are vacuumed... Maybe
-not really required in the end, possibly optional?
-> It could probably be named either way...but the one thing I really don't
-> want to do is start hacking server code. The PostgreSQL people have enough
-> to do without worrying about trying to meld anything I've done to their
-> server. :)
- Yea, they probably would appreciate that. They already have enough
-on thier plate for 7.x as it is! :)
-> Besides, I like the idea of having it operate as a stand-alone product.
-> The only PostgreSQL feature we would require would be triggers and
-> plpgsql...what was the earliest version of PostgreSQL that supported
-> plpgsql? Even then I don't see the triggers being that complex to boot.
- No, provided that we don't do the identification number idea
-(which the more I think about it, probably will not work). As for what
-version support plpgsql, I don't know, one of the more hard-core pgsql
-hackers can probably tell us that.
-> The only thing we'd need for Python is the Python extensions for
-> PostgreSQL...which in turn requires libpq and that's about it. So, it
-> should be able to run on any platform supported by Python and libpq.
- Of course. If it ran on NT as well as Linux/Unix, that would be
-even better. :)
-> Unix folks should be happy....assuming they have X running on the
-> machine doing the replication or syncing. Even then I wrote a curses
-> based Python interface awhile back which allows buttons, progress
-> bars, input fields, etc (I called it tinter and it's available at
-> http://iximd.com/~dwalker). It's a simple interface and could
-> probably be cleaned up a bit but it works. :)
- Why would we want any type of GUI (X11 or curses) for this sync
-program. I imagine just a command line program with a few options (local
-machine, remote machine, db name, etc...), and nothing else.
- Though I will take a look at your curses interface, as I have been
-wanting to make a curses interface to a few db interfaces I have, in a
-simple as manner as possible.
-> That would be a Good Thing. Have webspace somewhere? If I can get
-> permission from the "powers that be" at the office I could host a website on
-> our (Domino) webserver.
- Yea, I got my own web server (www.rkirkpat.net) with 1GB+ of disk
-space available, sitting on a decent speed DSL. Even can setup of a
-virtual server if we want (i.e. pgsync.rkirkpat.net :). CVS repository,
-email lists, etc... possible with some effort (and time).
- So, where should we start? TTYL.
- PS. The current pages on my web site are very out of date at the
-moment (save for the pgsql information). I hope to have updated ones up
-within the week.
-| "For to me to live is Christ, and to die is gain." |
-| --- Philippians 1:21 (KJV) |
-| Ryan Kirkpatrick | Boulder, Colorado | http://www.rkirkpat.net/ |
-From owner-pgsql-hackers@hub.org Mon Dec 27 12:33:32 1999
-Received: from hub.org (hub.org [])
- by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id NAA24817
- for <pgman@candle.pha.pa.us>; Mon, 27 Dec 1999 13:33:29 -0500 (EST)
-Received: from localhost (majordom@localhost)
- by hub.org (8.9.3/8.9.3) with SMTP id NAA53391;
- Mon, 27 Dec 1999 13:29:02 -0500 (EST)
- (envelope-from owner-pgsql-hackers)
-Received: by hub.org (bulk_mailer v1.5); Mon, 27 Dec 1999 13:28:38 -0500
-Received: (from majordom@localhost)
- by hub.org (8.9.3/8.9.3) id NAA53248
- for pgsql-hackers-outgoing; Mon, 27 Dec 1999 13:27:40 -0500 (EST)
- (envelope-from owner-pgsql-hackers@postgreSQL.org)
-Received: from gtv.ca (h139-142-238-17.cg.fiberone.net [])
- by hub.org (8.9.3/8.9.3) with ESMTP id NAA53170
- for <pgsql-hackers@hub.org>; Mon, 27 Dec 1999 13:26:40 -0500 (EST)
- (envelope-from aaron@genisys.ca)
-Received: from stilborne ( [])
- by gtv.ca (8.9.3/8.8.7) with SMTP id MAA01200
- for <pgsql-hackers@hub.org>; Mon, 27 Dec 1999 12:36:39 -0700
-From: "Aaron J. Seigo" <aaron@gtv.ca>
-To: pgsql-hackers@hub.org
-Subject: Re: [HACKERS] database replication
-Date: Mon, 27 Dec 1999 11:23:19 -0700
-X-Mailer: KMail [version 1.0.28]
-Content-Type: text/plain
-References: <199912271135.TAA10184@netrinsics.com>
-In-Reply-To: <199912271135.TAA10184@netrinsics.com>
-MIME-Version: 1.0
-Message-Id: <99122711245600.07929@stilborne>
-Content-Transfer-Encoding: 8bit
-Sender: owner-pgsql-hackers@postgreSQL.org
-Status: OR
-> Before anyone starts implementing any database replication, I'd strongly
-> suggest doing some research, first:
-> http://sybooks.sybase.com:80/onlinebooks/group-rs/rsg1150e/rs_admin/@Generic__BookView;cs=default;ts=default
-good idea, but perhaps sybase isn't the best study case.. here's some extremely
-detailed online coverage of Oracle 8i's replication, from the oracle online
-Aaron J. Seigo
-Sys Admin
-From owner-pgsql-hackers@hub.org Thu Dec 30 08:01:09 1999
-Received: from renoir.op.net (root@renoir.op.net [])
- by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id JAA10317
- for <pgman@candle.pha.pa.us>; Thu, 30 Dec 1999 09:01:08 -0500 (EST)
-Received: from hub.org (hub.org []) by renoir.op.net (o1/$Revision: 1.10 $) with ESMTP id IAA02365 for <pgman@candle.pha.pa.us>; Thu, 30 Dec 1999 08:37:10 -0500 (EST)
-Received: from localhost (majordom@localhost)
- by hub.org (8.9.3/8.9.3) with SMTP id IAA87902;
- Thu, 30 Dec 1999 08:34:22 -0500 (EST)
- (envelope-from owner-pgsql-hackers)
-Received: by hub.org (bulk_mailer v1.5); Thu, 30 Dec 1999 08:32:24 -0500
-Received: (from majordom@localhost)
- by hub.org (8.9.3/8.9.3) id IAA85771
- for pgsql-hackers-outgoing; Thu, 30 Dec 1999 08:31:27 -0500 (EST)
- (envelope-from owner-pgsql-hackers@postgreSQL.org)
-Received: from sandman.acadiau.ca (dcurrie@sandman.acadiau.ca [])
- by hub.org (8.9.3/8.9.3) with ESMTP id IAA85234
- for <pgsql-hackers@postgresql.org>; Thu, 30 Dec 1999 08:31:10 -0500 (EST)
- (envelope-from dcurrie@sandman.acadiau.ca)
-Received: (from dcurrie@localhost)
- by sandman.acadiau.ca (8.8.8/8.8.8/Debian/GNU) id GAA18698;
- Thu, 30 Dec 1999 06:30:58 -0400
-From: Duane Currie <dcurrie@sandman.acadiau.ca>
-Message-Id: <199912301030.GAA18698@sandman.acadiau.ca>
-Subject: Re: [HACKERS] database replication
-In-Reply-To: <OFD38C9424.B391F434-ON85256851.0054F41A@black-oak.COM> from "DWalker@black-oak.com" at "Dec 24, 99 10:27:59 am"
-To: DWalker@black-oak.com
-Date: Thu, 30 Dec 1999 10:30:58 +0000 (AST)
-Cc: pgsql-hackers@postgresql.org
-X-Mailer: ELM [version 2.4ME+ PL39 (25)]
-MIME-Version: 1.0
-Content-Type: text/plain; charset=US-ASCII
-Content-Transfer-Encoding: 7bit
-Sender: owner-pgsql-hackers@postgresql.org
-Status: OR
-Hi Guys,
-Now for one of my REALLY rare posts.
-Having done a little bit of distributed data systems, I figured I'd
-pitch in a couple cents worth.
-> 2) The replication system will need to add at least one field to each
-> table in each database that needs to be re plicated. This
-> field will be a date/time stamp which identifies the " last
-> update" of the record. This field will be called PGR_TIME
-> for la ck of a better name. Because this field will be used
-> from within programs and triggers it can be longer so as to not
-> mistake it for a user field.
-I just started reading this thread, but I figured I'd throw in a couple
-suggestions for distributed data control (a few idioms I've had to
-deal with b4):
- - Never use time (not reliable from system to system). Use
- a version number of some sort that can stay consistent across
- all replicas
- This way, if a system's time is or goes out of wack, it doesn't
- cause your database to disintegrate, and it's easier to track
- conflicts (see below. If using time, the algorithm gets
- nightmarish)
- - On an insert, set to version 1
- - On an update, version++
- - On a delete, mark deleted, and add a delete stub somewhere for the
- replicator process to deal with in sync'ing the databases.
- - If two records have the same version but different data, there's
- a conflict. A few choices:
- 1. Pick one as the correct one (yuck!! invisible data loss)
- 2. Store both copies, pick one as current, and alert
- database owner of the conflict, so they can deal with
- it "manually."
- 3. If possible, some conflicts can be merged. If a disjoint
- set of fields were changed in each instance, these changes
- may both be applied and the record merged. (Problem:
- takes a lot more space. Requires a version number for
- every field, or persistent storage of some old records.
- However, this might help the "which fields changed" issue
- you were talking about in #6)
- - A unique id across all systems should exist (or something that
- effectively simulates a unique id. Maybe a composition of the
- originating oid (from the insert) and the originating database
- (oid of the database's record?) might do it. Store this as
- an extra field in every record.
- (Two extra fieldss so far: 'unique id' and 'version')
-I do like your approach: triggers and a separate process. (Maintainable!! :)
-Anyway, just figured I'd throw in a few suggestions,
-From owner-pgsql-patches@hub.org Sun Jan 2 23:01:38 2000
-Received: from renoir.op.net (root@renoir.op.net [])
- by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id AAA16274
- for <pgman@candle.pha.pa.us>; Mon, 3 Jan 2000 00:01:28 -0500 (EST)
-Received: from hub.org (hub.org []) by renoir.op.net (o1/$Revision: 1.10 $) with ESMTP id XAA02655 for <pgman@candle.pha.pa.us>; Sun, 2 Jan 2000 23:45:55 -0500 (EST)
-Received: from hub.org (hub.org [])
- by hub.org (8.9.3/8.9.3) with ESMTP id XAA13828;
- Sun, 2 Jan 2000 23:40:47 -0500 (EST)
- (envelope-from owner-pgsql-patches@hub.org)
-Received: by hub.org (TLB v0.10a (1.23 tibbs 1997/01/09 00:29:32)); Sun, 02 Jan 2000 23:38:34 +0000 (EST)
-Received: (from majordom@localhost)
- by hub.org (8.9.3/8.9.3) id XAA13624
- for pgsql-patches-outgoing; Sun, 2 Jan 2000 23:37:36 -0500 (EST)
- (envelope-from owner-pgsql-patches@postgreSQL.org)
-Received: from falla.videotron.net (falla.videotron.net [])
- by hub.org (8.9.3/8.9.3) with ESMTP id XAA13560
- for <pgsql-patches@postgresql.org>; Sun, 2 Jan 2000 23:37:02 -0500 (EST)
- (envelope-from P.Marchesso@Videotron.ca)
-Received: from Videotron.ca ([])
- by falla.videotron.net (Sun Internet Mail Server sims.3.5.1999.
- with ESMTP id <0FNQ000TEST8VI@falla.videotron.net> for pgsql-patches@postgresql.org; Sun,
- 2 Jan 2000 23:37:01 -0500 (EST)
-Date: Sun, 02 Jan 2000 23:39:23 -0500
-From: Philippe Marchesseault <P.Marchesso@Videotron.ca>
-Subject: [PATCHES] Distributed PostgreSQL!
-To: pgsql-patches@postgreSQL.org
-Message-id: <387027FB.EB88D757@Videotron.ca>
-MIME-version: 1.0
-X-Mailer: Mozilla 4.51 [en] (X11; I; Linux 2.2.11 i586)
-Content-type: MULTIPART/MIXED; BOUNDARY="Boundary_(ID_GeYGc69fE1/bkYLTPwOGFg)"
-X-Accept-Language: en
-Sender: owner-pgsql-patches@postgreSQL.org
-Precedence: bulk
-Status: ORr
-This is a multi-part message in MIME format.
-Content-type: text/plain; charset=us-ascii
-Content-transfer-encoding: 7bit
-Hi all!
-Here is a small patch to make postgres a distributed database. By
-distributed I mean that you can have the same copy of the database on N
-different machines and keep them all in sync.
-It does not improve performances unless you distribute your clients in a
-sensible manner. It does not allow you to do parallel selects.
-The support page is : pages.infinit.net/daemon and soon to be in
-The patch was tested with RedHat Linux 6.0 on Intel with kernel 2.2.11.
-Only two machines where used so i'm not competely sure that it works
-with more than two. -But it should-
-I would like to know if somebody else is interested in this otherwise
-i'm probably not gonna keep it growing. So please reply me to my e-mail
-(P.Marchesso@videotron.ca) to give me an idea of the amount of people
-interested in this.
-Thanks all.
-Philippe Marchesseault
-It's not the size of the dog in the fight,
-but the size of the fight in the dog.
- -Archie Griffen
-Content-type: application/octet-stream; name=replicator-0.1.tgz
-Content-disposition: attachment; filename=replicator-0.1.tgz
-Content-transfer-encoding: base64
-From owner-pgsql-hackers@hub.org Mon Jan 3 13:47:07 2000
-Received: from hub.org (hub.org [])
- by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id OAA23987
- for <pgman@candle.pha.pa.us>; Mon, 3 Jan 2000 14:47:06 -0500 (EST)
-Received: from localhost (majordom@localhost)
- by hub.org (8.9.3/8.9.3) with SMTP id OAA03234;
- Mon, 3 Jan 2000 14:39:56 -0500 (EST)
- (envelope-from owner-pgsql-hackers)
-Received: by hub.org (bulk_mailer v1.5); Mon, 3 Jan 2000 14:39:49 -0500
-Received: (from majordom@localhost)
- by hub.org (8.9.3/8.9.3) id OAA03050
- for pgsql-hackers-outgoing; Mon, 3 Jan 2000 14:38:50 -0500 (EST)
- (envelope-from owner-pgsql-hackers@postgreSQL.org)
-Received: from ara.zf.jcu.cz (zakkr@ara.zf.jcu.cz [])
- by hub.org (8.9.3/8.9.3) with ESMTP id OAA02975
- for <pgsql-hackers@postgreSQL.org>; Mon, 3 Jan 2000 14:38:05 -0500 (EST)
- (envelope-from zakkr@zf.jcu.cz)
-Received: from localhost (zakkr@localhost)
- by ara.zf.jcu.cz (8.9.3/8.9.3/Debian/GNU) with SMTP id UAA19297;
- Mon, 3 Jan 2000 20:23:35 +0100
-Date: Mon, 3 Jan 2000 20:23:35 +0100 (CET)
-From: Karel Zak - Zakkr <zakkr@zf.jcu.cz>
-To: P.Marchesso@videotron.ca
-cc: pgsql-hackers <pgsql-hackers@postgresql.org>
-Subject: [HACKERS] replicator
-Message-ID: <Pine.LNX.3.96.1000103194931.19115A-100000@ara.zf.jcu.cz>
-MIME-Version: 1.0
-Content-Type: TEXT/PLAIN; charset=US-ASCII
-Sender: owner-pgsql-hackers@postgresql.org
-Status: OR
-I look at your (Philippe's) replicator, but I don't good understand
-your replication concept.
- node1: SQL --IPC--> node-broker
- |
- |
- master-node --IPC--> replikator
- | | |
- libpq
- | | |
- node2 node..n
-(Is it right picture?)
-If I good understand, all nodes make connection to master node and data
-replicate "replicator" on this master node. But it (master node) is very
-critical space in this concept - If master node not work replication for
-*all* nodes is lost. Hmm.. but I want use replication for high available
-IMHO is problem with node registration / authentification on master node.
-Why concept is not more upright? As:
- SQL --IPC--> node-replicator
- | | |
- via libpq send data to all nodes with
- current client/backend auth.
- (not exist any master node, all nodes have connection to all nodes)
-Use replicator as external proces and copy data from SQL to this replicator
-via IPC is (your) very good idea.
- Karel
-Karel Zak <zakkr@zf.jcu.cz> http://home.zf.jcu.cz/~zakkr/
-Docs: http://docs.linux.cz (big docs archive)
-Kim Project: http://home.zf.jcu.cz/~zakkr/kim/ (process manager)
-FTP: ftp://ftp2.zf.jcu.cz/users/zakkr/ (C/ncurses/PgSQL)
-From owner-pgsql-hackers@hub.org Tue Jan 4 10:31:01 2000
-Received: from renoir.op.net (root@renoir.op.net [])
- by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id LAA17522
- for <pgman@candle.pha.pa.us>; Tue, 4 Jan 2000 11:31:00 -0500 (EST)
-Received: from hub.org (hub.org []) by renoir.op.net (o1/$Revision: 1.10 $) with ESMTP id LAA01541 for <pgman@candle.pha.pa.us>; Tue, 4 Jan 2000 11:27:30 -0500 (EST)
-Received: from localhost (majordom@localhost)
- by hub.org (8.9.3/8.9.3) with SMTP id LAA09992;
- Tue, 4 Jan 2000 11:18:07 -0500 (EST)
- (envelope-from owner-pgsql-hackers)
-Received: by hub.org (bulk_mailer v1.5); Tue, 4 Jan 2000 11:17:58 -0500
-Received: (from majordom@localhost)
- by hub.org (8.9.3/8.9.3) id LAA09856
- for pgsql-hackers-outgoing; Tue, 4 Jan 2000 11:17:17 -0500 (EST)
- (envelope-from owner-pgsql-hackers@postgreSQL.org)
-Received: from ara.zf.jcu.cz (zakkr@ara.zf.jcu.cz [])
- by hub.org (8.9.3/8.9.3) with ESMTP id LAA09763
- for <pgsql-hackers@postgreSQL.org>; Tue, 4 Jan 2000 11:16:43 -0500 (EST)
- (envelope-from zakkr@zf.jcu.cz)
-Received: from localhost (zakkr@localhost)
- by ara.zf.jcu.cz (8.9.3/8.9.3/Debian/GNU) with SMTP id RAA31673;
- Tue, 4 Jan 2000 17:02:06 +0100
-Date: Tue, 4 Jan 2000 17:02:06 +0100 (CET)
-From: Karel Zak - Zakkr <zakkr@zf.jcu.cz>
-To: Philippe Marchesseault <P.Marchesso@Videotron.ca>
-cc: pgsql-hackers <pgsql-hackers@postgreSQL.org>
-Subject: Re: [HACKERS] replicator
-In-Reply-To: <38714B6F.2DECAEC0@Videotron.ca>
-Message-ID: <Pine.LNX.3.96.1000104162226.27234D-100000@ara.zf.jcu.cz>
-MIME-Version: 1.0
-Content-Type: TEXT/PLAIN; charset=US-ASCII
-Sender: owner-pgsql-hackers@postgreSQL.org
-Status: OR
-On Mon, 3 Jan 2000, Philippe Marchesseault wrote:
-> So it could become:
-> SQL --IPC--> node-replicator
-> | | |
-> via TCP send statements to each node
-> replicator (on local node)
-> |
-> via libpq send data to
-> current (local) backend.
-> > (not exist any master node, all nodes have connection to all nodes)
-> Exactly, if the replicator dies only the node dies, everything else keeps
-> working.
- Hi,
- I a little explore replication conception on Oracle and Sybase (in manuals).
-(Know anyone some interesting links or publication about it?)
- Firstly, I sure, untimely is write replication to PgSQL now, if we
-haven't exactly conception for it. It need more suggestion from more
-developers. We need firstly answers for next qestion:
- 1/ How replication concept choose for PG?
- 2/ How manage transaction for nodes? (and we need define any
- replication protocol for this)
- 3/ How involve replication in current PG transaction code?
-My idea (dream:-) is replication that allow you use full read-write on all
-nodes and replication which use current transaction method in PG - not is
-difference between more backends on one host or more backend on more hosts
-- it makes "global transaction consistency".
-Now is transaction manage via ICP (one host), my dream is alike manage
-this transaction, but between more host via TCP. (And make optimalization
-for this - transfer commited data/commands only.)
-Any suggestion?
-(transaction oriented replication)
- Sybase - I. model (only one node is read-write)
- primary SQL data (READ-WRITE)
- |
- replication agent (transaction log monitoring)
- |
- primary distribution server (one or more repl. servers)
- | / | \
- | nodes (READ-ONLY)
- |
- secondary dist. server
- / | \
- nodes (READ-ONLY)
- If primary SQL is read-write and the other nodes *read-only*
- => system good work if connection is disable (data are save to
- replication-log and if connection is available log is write
- to node).
- Sybase - II. model (all nodes read-write)
- SQL data 1 --->--+ NODE I.
- | |
- ^ |
- | replication agent 1 (transaction log monitoring)
- V |
- | V
- | |
- replication server 1
- |
- ^
- V
- |
- replication server 2 NODE II.
- | |
- ^ +-<-->--- SQL data 2
- | |
- replcation agent 2 -<--
-Sorry, I not sure if I re-draw previous picture total good..
- Karel
-From pgsql-hackers-owner+M3133@hub.org Fri Jun 9 15:02:25 2000
-Received: from hub.org (root@hub.org [])
- by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id PAA22319
- for <pgman@candle.pha.pa.us>; Fri, 9 Jun 2000 15:02:24 -0400 (EDT)
-Received: from hub.org (majordom@localhost [])
- by hub.org (8.10.1/8.10.1) with SMTP id e59IsET81137;
- Fri, 9 Jun 2000 14:54:14 -0400 (EDT)
-Received: from ultra2.quiknet.com (ultra2.quiknet.com [])
- by hub.org (8.10.1/8.10.1) with SMTP id e59IrQT80458
- for <pgsql-hackers@postgresql.org>; Fri, 9 Jun 2000 14:53:26 -0400 (EDT)
-Received: (qmail 13302 invoked from network); 9 Jun 2000 18:53:21 -0000
-Received: from 18.67.tc1.oro.pmpool.quiknet.com (HELO quiknet.com) (pecondon@
- by ultra2.quiknet.com with SMTP; 9 Jun 2000 18:53:21 -0000
-Message-ID: <39413D08.A6BDC664@quiknet.com>
-Date: Fri, 09 Jun 2000 11:52:57 -0700
-From: Paul Condon <pecondon@quiknet.com>
-X-Mailer: Mozilla 4.73 [en] (X11; U; Linux 2.2.14-5.0 i686)
-X-Accept-Language: en
-MIME-Version: 1.0
-To: ohp@pyrenet.fr, pgsql-hackers@postgresql.org
-Subject: [HACKERS] Re: Big project, please help
-Content-Type: text/plain; charset=us-ascii
-Content-Transfer-Encoding: 7bit
-X-Mailing-List: pgsql-hackers@postgresql.org
-Precedence: bulk
-Sender: pgsql-hackers-owner@hub.org
-Status: OR
-Two way replication on a single "table" is availabe in Lotus Notes. In
-Notes, every record has a time-stamp, which contains the time of the
-last update. (It also has a creation timestamp.) During replication,
-timestamps are compared at the row/record level, and compared with the
-timestamp of the last replication. If, for corresponding rows in two
-replicas, the timestamp of one row is newer than the last replication,
-the contents of this newer row is copied to the other replica. But if
-both of the corresponding rows have newer timestamps, there is a
-problem. The Lotus Notes solution is to:
- 1. send a replication conflict message to the Notes Administrator,
-which message contains full copies of both rows.
- 2. copy the newest row over the less new row in the replicas.
- 3. there is a mechanism for the Administrator to reverse the default
-decision in 2, if the semantics of the message history, or off-line
-investigation indicates that the wrong decision was made.
-In practice, the Administrator is not overwhelmed with replication
-conflict messages because updates usually only originate at the site
-that originally created the row. Or updates fill only fields that were
-originally 'TBD'. The full logic is perhaps more complicated than I have
-described here, but it is already complicated enough to give you an idea
-of what you're really being asked to do. I am not aware of a supplier of
-relational database who really supports two way replication at the level
-that Notes supports it, but Notes isn't a relational database.
-The difficulty of the position that you appear to be in is that
-management might believe that the full problem is solved in brand X
-RDBMS, and you will have trouble convincing management that this is not
-really true.
-From pgsql-hackers-owner+M2401@hub.org Tue May 23 12:19:54 2000
-Received: from news.tht.net (news.hub.org [])
- by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id MAA28410
- for <pgman@candle.pha.pa.us>; Tue, 23 May 2000 12:19:53 -0400 (EDT)
-Received: from hub.org (majordom@hub.org [])
- by news.tht.net (8.9.3/8.9.3) with ESMTP id MAB53304;
- Tue, 23 May 2000 12:00:08 -0400 (EDT)
- (envelope-from pgsql-hackers-owner+M2401@hub.org)
-Received: from gwineta.repas.de (gwineta.repas.de [])
- by hub.org (8.9.3/8.9.3) with ESMTP id LAA39896
- for <pgsql-hackers@postgresql.org>; Tue, 23 May 2000 11:57:31 -0400 (EDT)
- (envelope-from kardos@repas-aeg.de)
-Received: (from smap@localhost)
- by gwineta.repas.de (8.8.8/8.8.8) id RAA27154
- for <pgsql-hackers@postgresql.org>; Tue, 23 May 2000 17:57:23 +0200
-Received: from dragon.dr.repas.de( by gwineta.repas.de via smap (V2.1)
- id xma027101; Tue, 23 May 00 17:56:20 +0200
-Received: from kardos.dr.repas.de ([])
- by dragon.dr.repas.de (UCX V4.2-21C, OpenVMS V6.2 Alpha);
- Tue, 23 May 2000 17:57:24 +0200
-Message-ID: <010201bfc4cf$7334d5a0$99301eac@Dr.repas.de>
-From: "Kardos, Dr. Andreas" <kardos@repas-aeg.de>
-To: "Todd M. Shrider" <tshrider@varesearch.com>,
- <pgsql-hackers@postgresql.org>
-References: <Pine.LNX.4.04.10005180846290.15739-100000@silicon.su.valinux.com>
-Subject: Re: [HACKERS] failing over with postgresql
-Date: Tue, 23 May 2000 17:56:20 +0200
-Organization: repas AEG Automation GmbH
-MIME-Version: 1.0
-Content-Type: text/plain;
- charset="iso-8859-1"
-Content-Transfer-Encoding: 8bit
-X-Priority: 3
-X-MSMail-Priority: Normal
-X-Mailer: Microsoft Outlook Express 5.00.2314.1300
-X-MimeOLE: Produced By Microsoft MimeOLE V5.00.2314.1300
-X-Mailing-List: pgsql-hackers@postgresql.org
-Precedence: bulk
-Sender: pgsql-hackers-owner@hub.org
-Status: OR
-For a SCADA system (Supervisory Control and Data Akquisition) which consists
-of one master and one hot-standby server I have implemented such a
-solution. To these UNIX servers client workstations are connected (NT and/or
-UNIX). The database client programms run on client and server side.
-When developing this approach I had to goals in mind:
-1) Not to get dependend on the PostgreSQL sources since they change very
-2) Not to get dependend on the fe/be protocol since there are discussions
-around to change it.
-So the approach is quite simple: Forward all database requests to the
-standby server on TCP/IP level.
-On both servers the postmaster listens on port 5433 and not on 5432. On
-standard port 5432 my program listens instead. This program forks twice for
-every incomming connection. The first instance forwards all packets from the
-frontend to both backends. The second instance receives the packets from all
-backends and forwards the packets from the master backend to the frontend.
-So a frontend running on a server machine connects to port 5432 of
-On the client machine runs another program (on NT as a service). This
-program forks for every incomming connections twice. The first instance
-forwards all packets to port 5432 of the current master server and the
-second instance forwards the packets from the master server to the frontend.
-During standby computer startup the database of the master computer is
-dumped, zipped, copied to the standby computer, unzipped and loaded into
-that database.
-If a standby startup took place, all client connections are aborted to allow
-a login into the standby database. The frontends need to reconnect in this
-case. So the database of the standby computer is always in sync.
-The disadvantage of this method is that a query cannot be canceled in the
-standby server since the request key of this connections gets lost. But we
-can live with that.
-Both programms are able to run on Unix and on (native!) NT. On NT threads
-are created instead of forked processes.
-This approach is simple, but it is effective and it works.
-We hope to survive this way until real replication will be implemented in
-Andreas Kardos
------Ursprüngliche Nachricht-----
-Von: Todd M. Shrider <tshrider@varesearch.com>
-An: <pgsql-hackers@postgresql.org>
-Gesendet: Donnerstag, 18. Mai 2000 17:48
-Betreff: [HACKERS] failing over with postgresql
-> is anyone working on or have working a fail-over implentation for the
-> postgresql stuff. i'd be interested in seeing if and how any might be
-> dealing with just general issues as well as the database syncing issues.
-> we are looking to do this with heartbeat and lvs in mind. also if anyone
-> is load ballancing their databases that would be cool to talk about to.
-> ---
-> Todd M. Shrider VA Linux Systems
-> Systems Engineer
-> tshrider@valinux.com www.valinux.com
-From pgsql-hackers-owner+M3662@postgresql.org Tue Jan 23 16:23:34 2001
-Received: from mail.postgresql.org (webmail.postgresql.org [])
- by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id QAA04456
- for <pgman@candle.pha.pa.us>; Tue, 23 Jan 2001 16:23:34 -0500 (EST)
-Received: from mail.postgresql.org (webmail.postgresql.org [])
- by mail.postgresql.org (8.11.1/8.11.1) with SMTP id f0NLKf004705;
- Tue, 23 Jan 2001 16:20:41 -0500 (EST)
- (envelope-from pgsql-hackers-owner+M3662@postgresql.org)
-Received: from sectorbase2.sectorbase.com ([])
- by mail.postgresql.org (8.11.1/8.11.1) with SMTP id f0NLAe003753
- for <pgsql-hackers@postgresql.org>; Tue, 23 Jan 2001 16:10:40 -0500 (EST)
- (envelope-from vmikheev@SECTORBASE.COM)
-Received: by sectorbase2.sectorbase.com with Internet Mail Service (5.5.2653.19)
- id <DG1W4Q8F>; Tue, 23 Jan 2001 12:49:07 -0800
-Message-ID: <8F4C99C66D04D4118F580090272A7A234D32AF@sectorbase1.sectorbase.com>
-From: "Mikheev, Vadim" <vmikheev@SECTORBASE.COM>
-To: "'dom@idealx.com'" <dom@idealx.com>, pgsql-hackers@postgresql.org
-Subject: RE: [HACKERS] Re: AW: Re: MySQL and BerkleyDB (fwd)
-Date: Tue, 23 Jan 2001 13:10:34 -0800
-MIME-Version: 1.0
-X-Mailer: Internet Mail Service (5.5.2653.19)
-Content-Type: text/plain;
- charset="iso-8859-1"
-Precedence: bulk
-Sender: pgsql-hackers-owner@postgresql.org
-Status: ORr
-> I had thought that the pre-commit information could be stored in an
-> auxiliary table by the middleware program ; we would then have
-> to re-implement some sort of higher-level WAL (I thought of the list
-> of the commands performed in the current transaction, with a sequence
-> number for each of them that would guarantee correct ordering between
-> concurrent transactions in case of a REDO). But I fear I am missing
-This wouldn't work for READ COMMITTED isolation level.
-But why do you want to log commands into WAL where each modification
-is already logged in, hm, correct order?
-Well, it has sense if you're looking for async replication but
-you need not in two-phase commit for this and should aware about
-problems with READ COMMITTED isolevel.
-Back to two-phase commit - it's easiest part of work required for
-distributed transaction processing.
-Currently we place single commit record to log and transaction is
-committed when this record (and so all other transaction records)
-is on disk.
-Two-phase commit:
-1. For 1st phase we'll place into log "prepared-to-commit" record
- and this phase will be accomplished after record is flushed on disk.
- At this point transaction may be committed at any time because of
- all its modifications are logged. But it still may be rolled back
- if this phase failed on other sites of distributed system.
-2. When all sites are prepared to commit we'll place "committed"
- record into log. No need to flush it because of in the event of
- crash for all "prepared" transactions recoverer will have to
- communicate other sites to know their statuses anyway.
-That's all! It is really hard to implement distributed lock- and
-communication- managers but there is no problem with logging two
-records instead of one. Period.
-From pgsql-hackers-owner+M3665@postgresql.org Tue Jan 23 17:05:26 2001
-Received: from mail.postgresql.org (webmail.postgresql.org [])
- by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id RAA05972
- for <pgman@candle.pha.pa.us>; Tue, 23 Jan 2001 17:05:24 -0500 (EST)
-Received: from mail.postgresql.org (webmail.postgresql.org [])
- by mail.postgresql.org (8.11.1/8.11.1) with SMTP id f0NM31008120;
- Tue, 23 Jan 2001 17:03:01 -0500 (EST)
- (envelope-from pgsql-hackers-owner+M3665@postgresql.org)
-Received: from candle.pha.pa.us (candle.navpoint.com [])
- by mail.postgresql.org (8.11.1/8.11.1) with ESMTP id f0NLsU007188
- for <pgsql-hackers@postgresql.org>; Tue, 23 Jan 2001 16:54:30 -0500 (EST)
- (envelope-from pgman@candle.pha.pa.us)
-Received: (from pgman@localhost)
- by candle.pha.pa.us (8.9.0/8.9.0) id QAA05300;
- Tue, 23 Jan 2001 16:53:53 -0500 (EST)
-From: Bruce Momjian <pgman@candle.pha.pa.us>
-Message-Id: <200101232153.QAA05300@candle.pha.pa.us>
-Subject: Re: [HACKERS] Re: AW: Re: MySQL and BerkleyDB (fwd)
-In-Reply-To: <8F4C99C66D04D4118F580090272A7A234D32AF@sectorbase1.sectorbase.com>
- "from Mikheev, Vadim at Jan 23, 2001 01:10:34 pm"
-To: "Mikheev, Vadim" <vmikheev@SECTORBASE.COM>
-Date: Tue, 23 Jan 2001 16:53:53 -0500 (EST)
-CC: "'dom@idealx.com'" <dom@idealx.com>, pgsql-hackers@postgresql.org
-X-Mailer: ELM [version 2.4ME+ PL77 (25)]
-MIME-Version: 1.0
-Content-Transfer-Encoding: 7bit
-Content-Type: text/plain; charset=US-ASCII
-Precedence: bulk
-Sender: pgsql-hackers-owner@postgresql.org
-Status: OR
-[ Charset ISO-8859-1 unsupported, converting... ]
-> > I had thought that the pre-commit information could be stored in an
-> > auxiliary table by the middleware program ; we would then have
-> > to re-implement some sort of higher-level WAL (I thought of the list
-> > of the commands performed in the current transaction, with a sequence
-> > number for each of them that would guarantee correct ordering between
-> > concurrent transactions in case of a REDO). But I fear I am missing
-> This wouldn't work for READ COMMITTED isolation level.
-> But why do you want to log commands into WAL where each modification
-> is already logged in, hm, correct order?
-> Well, it has sense if you're looking for async replication but
-> you need not in two-phase commit for this and should aware about
-> problems with READ COMMITTED isolevel.
-I believe the issue here is that while SERIALIZABLE ISOLATION means all
-queries can be run serially, our default is READ COMMITTED, meaning that
-open transactions see committed transactions, even if the transaction
-committed after our transaction started. (FYI, see my chapter on
-transactions for help, http://www.postgresql.org/docs/awbook.html.)
-To do higher-level WAL, you would have to record not only the queries,
-but the other queries that were committed at the start of each command
-in your transaction.
-Ideally, you could number every commit by its XID your log, and then
-when processing the query, pass the "committed" transaction ids that
-were visible at the time each command began.
-In other words, you can replay the queries in transaction commit order,
-except that you have to have some transactions committed at specific
-points while other transactions are open, i.e.:
-XID Open XIDS Query
-500 UPDATE t SET col = 3;
-501 500 BEGIN;
-501 500 UPDATE t SET col = 4;
-501 UPDATE t SET col = 5;
-501 COMMIT;
-This is a silly example, but it shows that 500 must commit after the
-first command in transaction 501, but before the second command in the
-transaction. This is because UPDATE t SET col = 5 actually sees the
-changes made by transaction 500 in READ COMMITTED isolation level.
-I am not advocating this. I think WAL is a better choice. I just
-wanted to outline how replaying the queries in commit order is
-> Back to two-phase commit - it's easiest part of work required for
-> distributed transaction processing.
-> Currently we place single commit record to log and transaction is
-> committed when this record (and so all other transaction records)
-> is on disk.
-> Two-phase commit:
-> 1. For 1st phase we'll place into log "prepared-to-commit" record
-> and this phase will be accomplished after record is flushed on disk.
-> At this point transaction may be committed at any time because of
-> all its modifications are logged. But it still may be rolled back
-> if this phase failed on other sites of distributed system.
-> 2. When all sites are prepared to commit we'll place "committed"
-> record into log. No need to flush it because of in the event of
-> crash for all "prepared" transactions recoverer will have to
-> communicate other sites to know their statuses anyway.
-> That's all! It is really hard to implement distributed lock- and
-> communication- managers but there is no problem with logging two
-> records instead of one. Period.
- Bruce Momjian | http://candle.pha.pa.us
- pgman@candle.pha.pa.us | (610) 853-3000
- + If your life is a hard drive, | 830 Blythe Avenue
- + Christ can be your backup. | Drexel Hill, Pennsylvania 19026
-From pgsql-general-owner+M805@postgresql.org Tue Nov 21 23:53:04 2000
-Received: from mail.postgresql.org (webmail.postgresql.org [])
- by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id AAA19262
- for <pgman@candle.pha.pa.us>; Wed, 22 Nov 2000 00:53:03 -0500 (EST)
-Received: from mail.postgresql.org (webmail.postgresql.org [])
- by mail.postgresql.org (8.11.1/8.11.1) with SMTP id eAM5qYs47249;
- Wed, 22 Nov 2000 00:52:34 -0500 (EST)
- (envelope-from pgsql-general-owner+M805@postgresql.org)
-Received: from racerx.cabrion.com (racerx.cabrion.com [])
- by mail.postgresql.org (8.11.1/8.11.1) with ESMTP id eAM5lJs46653
- for <pgsql-general@postgresql.org>; Wed, 22 Nov 2000 00:47:19 -0500 (EST)
- (envelope-from rob@cabrion.com)
-Received: from cabrionhome (gso163-25-211.triad.rr.com [])
- by racerx.cabrion.com (8.8.7/8.8.7) with SMTP id AAA13731
- for <pgsql-general@postgresql.org>; Wed, 22 Nov 2000 00:45:20 -0500
-Message-ID: <006501c05447$fb9aa0c0$4100fd0a@cabrion.org>
-From: "rob" <rob@cabrion.com>
-To: <pgsql-general@postgresql.org>
-Subject: [GENERAL] Synchronization Toolkit
-Date: Wed, 22 Nov 2000 00:49:29 -0500
-MIME-Version: 1.0
-Content-Type: multipart/mixed;
- boundary="----=_NextPart_000_0062_01C0541E.125CAF30"
-X-Priority: 3
-X-MSMail-Priority: Normal
-X-Mailer: Microsoft Outlook Express 5.50.4133.2400
-X-MimeOLE: Produced By Microsoft MimeOLE V5.50.4133.2400
-Precedence: bulk
-Sender: pgsql-general-owner@postgresql.org
-Status: OR
-This is a multi-part message in MIME format.
-Content-Type: text/plain; charset="iso-8859-1"
-Content-Transfer-Encoding: 7bit
-Not to be confused with replication, my concept of synchronization is to
-manage changes between a server table (or tables) and one or more mobile,
-disconnected databases (i.e. PalmPilot, laptop, etc.).
-I read through the notes in the TODO for this topic and devised a tool kit
-for doing synchronization. I hope that the Postgresql development community
-will find this useful and will help me refine this concept by offering
-insight, experience and some good old fashion hacking if you are so
-The bottom of this message describes how to use the attached files.
-I look forward to your feedback.
-I devised a concept that I call "session versioning". This means that every
-time a row changes it does NOT get a new version. Rather it gets stamped
-with the current session version common to all published tables. Clients,
-when they connect for synchronization, will immediately increment this
-common version number reserve the result as a "post version" and then
-increment the session version again. This version number, implemented as a
-sequence, is common to all synchronized tables and rows.
-Any time the server makes changes to the row gets stamped with the current
-session version, when the client posts its changes it uses the reserved
-"post version". The client then makes all it's changes stamping the changed
-rows with it's reserved "post version" rather than the current version. The
-reason why is explained later. It is important that the client post all its
-own changes first so that it does not end up receiving records which changed
-since it's last session that it is about to update anyway.
-Reserving the post version is a two step process. First, the number is
-simply stored in a variable for later use. Second, the value is added to a
-lock table (last_stable) to indicate to any concurrent sessions that rows
-with higher version numbers are to be considered "unstable" at the moment
-and they should not attempt to retrieve them at this time. Each client,
-upon connection, will use the lowest value in this lock table (max_version)
-to determine the upper boundary for versions it should retrieve. The lower
-boundary is simply the previous session's "max_version" plus one. Thus
-when the client retrieves changes is uses the following SQL "where"
-WHERE row_version >= max_version and row_version <= last_stable_version and
-version <> this_post_version
-The point of reserving and locking a post version is important in that it
-allows concurrent synchronization by multiple clients. The first, of many,
-clients to connect basically dictates to all future clients that they must
-not take any rows equal to or greater than the one which it just reserved
-and locked. The reason the session version is incremented a second time is
-so that the server may continue to post changes concurrent with any client
-changes and be certain that these concurrent server changes will not taint
-rows the client is about to retrieve. Once the client is finished with it's
-session it removes the lock on it's post version.
-Partitioning data for use by each node is the next challenge we face. How
-can we control which "slice" of data each client receives? A slice can be
-horizontal or vertical within a table. Horizontal slices are easy, it's
-just the where clause of an SQL statement that says "give me the rows that
-match X criteria". We handle this by storing and appending a where clause
-to each client's retrieval statement in addition to where clause described
-above. Actually, two where clauses are stored and appended. One is per
-client and one is per publication (table).
-We defined horizontal slices by filtering rows. Vertical slices are limits
-by column. The tool kit does provide a mechanism for pseudo vertical
-partitioning. When a client is "subscribed" to a publication, the toolkit
-stores what columns that node is to receive during a session. These are
-stored in the subscribed_cols table. While this does limit the number
-columns transmitted, the insert/update/delete triggers do not recognize
-changes based on columns. The "pseudo" nature of our vertical partitioning
-is evident by example:
-Say you have a table with name, address and phone number as columns. You
-restrict a client to see only name and address. This means that phone
-number information will not be sent to the client during synchronization,
-and the client can't attempt to alter the phone number of a given entry.
-Great, but . . . if, on the server, the phone number (but not the name or
-address) is changed, the entire row gets marked with a new version. This
-means that the name and address will get sent to the client even though they
-didn't change.
-Well, there's the flaw in vertical partitioning. Other than wasting
-bandwidth, the extra row does no harm to the process. The workaround for
-this is to highly normalize your schema when possible.
-Collisions are the next crux one encounters with synchronization. When two
-clients retrieve the same row and both make (different)changes, which one is
-correct? So far the system operates totally independent of time. This is
-good because it doesn't rely on the server or client to keep accurate time.
-We can just ignore time all together, but then we force our clients to
-synchronize on a strict schedule in order to avoid (or reduce) collisions.
-If every node synchronized immediately after making changes we could just
-stop here. Unfortunately this isn't reality. Reality dictates that of two
-clients: Client A & B will each pick up the same record on Monday. A will
-make changes on Monday, then leave for vacation. B will make changes on
-Wednesday because new information was gathered in A's absence. Client B
-posts those changes Wednesday. Meanwhile, client A returns from vacation on
-Friday and synchronizes his changes. A over writes B's changes even though
-A made changes before the most recent information was posted by B.
-It is clear that we need some form of time stamp to cope with the above
-example. While clocks aren't the most reliable, they are the only common
-version control available to solve this problem. The system is set up to
-accept (but not require) timestamps from clients and changes on the server
-are time stamped. The system, when presented a time stamp with a row, will
-compare them to figure out who wins in a tie. The system makes certain
-"sanity" checks with regard to these time stamps. A client may not attempt
-to post a change with a timestamp that is more than one hour in the future
-(according to what the server thinks "now" is) nor one hour before it's last
-synchronization date/time. The client row will be immediately placed into
-the collision table if the timestamp is that far out of whack.
-Implementations of the tool kit should take care to ensure that client &
-server agree on what "now" is before attempting to submit changes with
-Time stamps are not required. Should a client be incapable of tracking
-timestamps, etc. The system will assume that any server row which has been
-changed since the client's last session will win a tie. This is quite error
-prone, so timestamps are encouraged where possible.
-Inserts pose an interesting challenge. Since multiple clients cannot share
-a sequence (often used as a primary key) while disconnected. They will be
-responsible for their own unique "row_id" when inserting records. Inserts
-accept any arbitrary key, and write back to the client a special kind of
-update that gives the server's row_id. The client is responsible for making
-sure that this update takes place locally.
-Deletes are the last portion of the process. When deletes occur, the
-row_id, version, etc. are stored in a "deleted" table. These entries are
-retrieved by the client using the same version filter as described above.
-The table is pruned at the end of each session by deleting all records with
-versions that are less than the lowest 'last_version' stored for each
-Having wrapped up the synchronization process, I'll move on to describe some
-points about managing clients, publications and the like.
-The tool kit is split into two objects: SyncManagement and Synchronization.
-The Synchronization object exposes an API that client implementations use to
-communicate and receive changes. The management functions handle system
-install and uninstall in addition to publication of tables and client
-Installation and uninstallation are handled by their corresponding functions
-in the API. All system tables are prefixed and suffixed with four
-underscores, in hopes that this avoids conflict with an existing tables.
-Calling the install function more than once will generate an error message.
-Uninstall will remove all related tables, sequences, functions and triggers
-from the system.
-The first step, after installing the system, is to publish a table. A table
-can be published more than once under different names. Simply provide a
-unique name as the second argument to the publish function. Since object
-names are restricted to 32 characters in Postgres, each table is given a
-unique id and this id is used to create the trigger and sequence names.
-Since one table can be published multiple times, but only needs one set of
-triggers and one sequence for change management a reference count is kept so
-that we know when to add/drop triggers and functions. By default, all
-columns are published, but the third argument to the publish function
-accepts an array reference of column names that allows you to specify a
-limited set. Information about the table is stored in the "tables" table,
-info about the publication is in the "publications" table and column names
-are stored in "subscribed_cols" table.
-The next step is to subscribe a client to a table. A client is identified
-by a user name and a node name. The subscribe function takes three
-arguments: user, node & publication. The subscription process writes an
-entry into the "subscribed" table with default values. Of note, the
-"RefreshOnce" attribute is set to true whenever a table is published. This
-indicates to the system that a full table refresh should be sent the next
-time the client connects even if the client requests synchronization rather
-than refresh.
-The toolkit does not, yet, provide a way to manage the whereclause stored at
-either the publication or client level. To use or test this feature, you
-will need to set the whereclause attributes manually.
-Tables and users can be unpublished and unsubscribed using the corresponding
-functions within the tool kit's management interface. Because postgres
-lacks an "ALTER TABLE DROP COLUMN" function, the unpublish function only
-removes default values and indexes for those columns.
-The API isn't the most robust thing in the world right now. All functions
-return undef on success and an error string otherwise (like DBD). I hope to
-clean up the API considerably over the next month. The code has not been
-field tested at this time.
-The files attached are:
-1) SynKit.pm (A perl module that contains install/uninstall functions and a
-simple api for synchronization & management)
-2) sync_install.pl (Sample code to demonstrate the installation, publishing
-and subscribe process)
-3) sync_uninstall.pl (Sample code to demonstrate the uninstallation,
-unpublishing and unsubscribe process)
-To use them on Linux (don't know about Win32 but should work fine):
- - set up a test database and make SURE plpgsql is installed
- - install perl 5.05 along with Date::Parse(TimeDate-1.1) , DBI and DBD::Pg
-modules [www.cpan.org]
- - copy all three attached files to a test directory
- - cd to your test directory
- - edit all three files and change the three DBI variables to suit your
-system (they are clearly marked)
- - % perl sync_install.pl
- - check out the tables, functions & triggers installed
- - % perl sync.pl
- - check out the 'sync_test' table, do some updates/inserts/deletes and run
-sync.pl again
- NOTE: Sanity checks default to allow no more than 50% of the table
-to be changed by the client in a single session.
- If you delete all (or most of) the rows you will get errors when
-you run sync.pl again! (by design)
- - % perl sync_uninstall.pl (when you are done)
- - check out the sample scripts and the perl module code (commented, but
-not documented)
-Content-Type: application/octet-stream; name="sync.pl"
-Content-Transfer-Encoding: quoted-printable
-Content-Disposition: attachment; filename="sync.pl"
-# This script depicts the syncronization process for two users.
-my $dbi_connect_string =3D 'dbi:Pg:dbname=3Dtest;host=3Dsnoopy'; #
-my $db_user =3D 'test'; #
-my $db_pass =3D 'test'; #
-my $ret; #holds return value
-use SynKit;
-#create a synchronization object (pass dbi connection info)
-my $s =3D Synchronize->new($dbi_connect_string,$db_user,$db_pass);
-#start a session by passing a user name, "node" identifier and a collision =
-queue name (client or server)
-$ret =3D $s->start_session('JOE','REMOTE_NODE_NAME','server');
-print "Handle this error: $ret\n\n" if $ret;
-#call this once before attempting to apply individual changes
-$ret =3D $s->start_changes('sync_test',['name']);
-print "Handle this error: $ret\n\n" if $ret;
-#call this for each change the client wants to make to the database
-$ret =3D $s->apply_change(CLIENTROWID,'insert',undef,['ted']);
-print "Handle this error: $ret\n\n" if $ret;
-#call this for each change the client wants to make to the database
-$ret =3D $s->apply_change(CLIENTROWID,'insert','1973-11-10 11:25:00 AM -05=
-print "Handle this error: $ret\n\n" if $ret;
-#call this for each change the client wants to make to the database
-$ret =3D $s->apply_change(999,'update',undef,['tom']);
-print "Handle this error: $ret\n\n" if $ret;
-#call this for each change the client wants to make to the database
-$ret =3D $s->apply_change(1,'update',undef,['tom']);
-print "Handle this error: $ret\n\n" if $ret;
-#call this once after all changes have been submitted
-$ret =3D $s->end_changes();
-print "Handle this error: $ret\n\n" if $ret;
-#call this to get updates from all subscribed tables
-$ret =3D $s->get_all_updates();
-print "Handle this error: $ret\n\n" if $ret;
-print "\n\nSyncronization session is complete. (JOE) \n\n";
-# make some changes to the database (server perspective)
-print "\n\nMaking changes to the the database. (server side) \n\n";
-use DBI;
-my $dbh =3D DBI->connect($dbi_connect_string,$db_user,$db_pass);
-$dbh->do("insert into sync_test values ('roger')");
-$dbh->do("insert into sync_test values ('john')");
-$dbh->do("insert into sync_test values ('harry')");
-$dbh->do("delete from sync_test where name =3D 'roger'");
-$dbh->do("update sync_test set name =3D 'tom' where name =3D 'harry'");
-#now do another session for a different user
-#start a session by passing a user name, "node" identifier and a collision =
-queue name (client or server)
-$ret =3D $s->start_session('KEN','ANOTHER_REMOTE_NODE_NAME','server');
-print "Handle this error: $ret\n\n" if $ret;
-#call this to get updates from all subscribed tables
-$ret =3D $s->get_all_updates();
-print "Handle this error: $ret\n\n" if $ret;
-print "\n\nSynchronization session is complete. (KEN)\n\n";
-print "Now look at your database and see what happend, make changes to the =
-test table, etc. and run this again.\n\n";
-Content-Type: application/octet-stream; name="sync_uninstall.pl"
-Content-Transfer-Encoding: quoted-printable
-Content-Disposition: attachment; filename="sync_uninstall.pl"
-# this script uninstalls the synchronization system using the SyncManager o=
-use SynKit;
-### CHANGE THESE TO MATCH YOUR SYSTEM ########################
-my $dbi_connect_string =3D 'dbi:Pg:dbname=3Dtest;host=3Dsnoopy'; #
-my $db_user =3D 'test'; #
-my $db_pass =3D 'test'; #
-my $ret; #holds return value
-#create an instance of the SyncManager object
-my $m =3D SyncManager->new($dbi_connect_string,$db_user,$db_pass);
-# call this to unsubscribe a user/node (not necessary if you are uninstalli=
-print $m->unsubscribe('KEN','ANOTHER_REMOTE_NODE_NAME','sync_test');
-#call this to unpublish a table (not necessary if you are uninstalling)
-print $m->unpublish('sync_test');
-#call this to uninstall the syncronization system
-# NOTE: this will automatically unpublish & unsubscribe all users
-print $m->UNINSTALL;
-# now let's drop our little test table
-use DBI;
-my $dbh =3D DBI->connect($dbi_connect_string,$db_user,$db_pass);
-$dbh->do("drop table sync_test");
-print "\n\nI hope you enjoyed this little demonstration\n\n";
-Content-Type: application/octet-stream; name="sync_install.pl"
-Content-Transfer-Encoding: quoted-printable
-Content-Disposition: attachment; filename="sync_install.pl"
-# This script shows how to install the synchronization system=20
-# using the SyncManager object
-use SynKit;
-### CHANGE THESE TO MATCH YOUR SYSTEM ##########################
-my $dbi_connect_string =3D 'dbi:Pg:dbname=3Dtest;host=3Dsnoopy'; #
-my $db_user =3D 'test'; #
-my $db_pass =3D 'test'; #
-my $ret; #holds return value
-#create an instance of the sync manager object
-my $m =3D SyncManager->new($dbi_connect_string,$db_user,$db_pass);
-#Call this to install the syncronization management tables, etc.
-$ret =3D $m->INSTALL;
-die "Handle this error: $ret\n\n" if $ret;
-#create a test table for us to demonstrate with
-use DBI;
-my $dbh =3D DBI->connect($dbi_connect_string,$db_user,$db_pass);
-$dbh->do("create table sync_test (name text)");
-$dbh->do("insert into sync_test values ('rob')");
-$dbh->do("insert into sync_test values ('rob')");
-$dbh->do("insert into sync_test values ('rob')");
-$dbh->do("insert into sync_test values ('ted')");
-$dbh->do("insert into sync_test values ('ted')");
-$dbh->do("insert into sync_test values ('ted')");
-#call this to "publish" a table
-$ret =3D $m->publish('sync_test');
-print "Handle this error: $ret\n\n" if $ret;
-#call this to "subscribe" a user/node to a publication (table)
-$ret =3D $m->subscribe('JOE','REMOTE_NODE_NAME','sync_test');
-print "Handle this error: $ret\n\n" if $ret;
-#call this to "subscribe" a user/node to a publication (table)
-$ret =3D $m->subscribe('KEN','ANOTHER_REMOTE_NODE_NAME','sync_test');
-print "Handle this error: $ret\n\n" if $ret;
-print "Now you can do: 'perl sync.pl' a few times to play\n\n";
-print "Do 'perl sync_uninstall.pl' to uninstall the system\n";
-Content-Type: application/octet-stream; name="SynKit.pm"
-Content-Transfer-Encoding: quoted-printable
-Content-Disposition: attachment; filename="SynKit.pm"
-# Perl DB synchronization toolkit
-#created for postgres 7.0.2 +
-use strict;
- use vars qw($VERSION);
- # set the version for version checking
- $VERSION =3D 1.00;
-package Synchronize;
-use DBI;
-use Date::Parse;
-# new requires 3 arguments: dbi connection string, plus the corresponding u=
-sername and password to get connected to the database
-sub new {
- my $proto =3D shift;
- my $class =3D ref($proto) || $proto;
- my $self =3D {};
- my $dbi =3D shift;
- my $user =3D shift;
- my $pass =3D shift;
- $self->{DBH} =3D DBI->connect($dbi,$user,$pass) || die "Failed to connect =
-to database: ".DBI->errstr();
- $self->{user} =3D undef;
- $self->{node} =3D undef;
- $self->{status} =3D undef; # holds status of table update portion of sessi=
- $self->{pubs} =3D {}; #holds hash of pubs available to sessiom with val =
-=3D 1 if ok to request sync
- $self->{orderpubs} =3D undef; #holds array ref of subscribed pubs ordered =
-by sync_order
- $self->{this_post_ver} =3D undef; #holds the version number under which th=
-is session will post changes
- $self->{max_ver} =3D undef; #holds the maximum safe version for getting up=
- $self->{current} =3D {}; #holds the current publication info to which chan=
-ges are being applied
- $self->{queue} =3D 'server'; # tells collide function what to do with coll=
-isions. (default is to hold on server)
- $self->{DBLOG}=3D DBI->connect($dbi,$user,$pass) || die "cannot log to DB:=
- ".DBI->errstr();=20
- return bless ($self, $class);
-sub dblog {=20
- my $self =3D shift;
- my $msg =3D $self->{DBLOG}->quote($_[0]);
- my $quser =3D $self->{DBH}->quote($self->{user});
- my $qnode =3D $self->{DBH}->quote($self->{node});
- $self->{DBLOG}->do("insert into ____sync_log____ (username, nodename,stamp=
-, message) values($quser, $qnode, now(), $msg)");
-#start_session establishes session wide information and other housekeeping =
- # Accepts username, nodename and queue (client or server) as arguments;
-sub start_session {
- my $self =3D shift;
- $self->{user} =3D shift || die 'Username is required';
- $self->{node} =3D shift || die 'Nodename is required';
- $self->{queue} =3D shift;
- if ($self->{queue} ne 'server' && $self->{queue} ne 'client') {
- die "You must provide a queue argument of either 'server' or 'client'";
- }
- my $quser =3D $self->{DBH}->quote($self->{user});
- my $qnode =3D $self->{DBH}->quote($self->{node});
- my $sql =3D "select pubname from ____subscribed____ where username =3D $qu=
-ser and nodename =3D $qnode";
- my @pubs =3D $self->GetColList($sql);
- return 'User/Node has no subscriptions!' if !defined(@pubs);
- # go though the list and check permissions and rules for each
- foreach my $pub (@pubs) {
- my $qpub =3D $self->{DBH}->quote($pub);
- my $sql =3D "select disabled, pubname, fullrefreshonly, refreshonce,post_=
-ver from ____subscribed____ where username =3D $quser and pubname =3D $qpub=
- and nodename =3D $qnode";
- my $sth =3D $self->{DBH}->prepare($sql) || die $self->{DBH}->errstr;
- $sth->execute || die $self->{DBH}->errstr;
- my @row;
- while (@row =3D $sth->fetchrow_array) {
- next if $row[0]; #publication is disabled
- next if !defined($row[1]); #publication does not exist (should never occ=
- if ($row[2] || $row[3]) { #refresh of refresh once flag is set
- $self->{pubs}->{$pub} =3D 0; #refresh only
- next;
- }
- if (!defined($row[4])) { #no previous session exists, must refresh
- $self->{pubs}->{$pub} =3D 0; #refresh only
- next;
- }
- $self->{pubs}->{$pub} =3D 1; #OK for sync
- }
- $sth->finish;
- }
- $sql =3D "select pubname from ____publications____ order by sync_order";
- my @op =3D $self->GetColList($sql);
- my @orderpubs;
- #loop through ordered pubs and remove non subscribed publications
- foreach my $pub (@op) {
- push @orderpubs, $pub if defined($self->{pubs}->{$pub});
- }
- $self->{orderpubs} =3D \@orderpubs;
-# Now we obtain a session version number, etc.
- $self->{DBH}->{AutoCommit} =3D 0; #allows "transactions"
- $self->{DBH}->{RaiseError} =3D 1; #script [or eval] will automatically die=
- on errors
- eval { #start DB transaction
- #lock the version sequence until we determin that we have gotten
- #a good value. Lock will be released on commit.
- $self->{DBH}->do('lock ____version_seq____ in access exclusive mode');
- # remove stale locks if they exist
- my $sql =3D "delete from ____last_stable____ where username =3D $quser an=
-d nodename =3D $qnode";
- $self->{DBH}->do($sql);
- # increment version sequence & grab the next val as post_ver
- my $sql =3D "select nextval('____version_seq____')";
- my $sth =3D $self->{DBH}->prepare($sql);
- $sth->execute;
- ($self->{this_post_ver}) =3D $sth->fetchrow_array();
- $sth->finish;
- # grab max_ver from last_stable
- $sql =3D "select min(version) from ____last_stable____";=20
- $sth =3D $self->{DBH}->prepare($sql);
- $sth->execute;
- ($self->{max_ver}) =3D $sth->fetchrow_array();
- $sth->finish;
- # if there was no version in lock table, then take the ID that was in use
- # when we started the session ($max_ver -1)
- $self->{max_ver} =3D $self->{this_post_ver} -1 if (!defined($self->{max_v=
- # lock post_ver by placing it in last_stable
- $self->{DBH}->do("insert into ____last_stable____ (version, username, nod=
-ename) values ($self->{this_post_ver}, $quser,$qnode)");
- # increment version sequence again (discard result)
- $sql =3D "select nextval('____version_seq____')";
- $sth =3D $self->{DBH}->prepare($sql);
- $sth->execute;
- $sth->fetchrow_array();
- $sth->finish;
- }; #end eval/transaction
- if ($@) { # part of transaction failed
- return 'Start session failed';
- $self->{DBH}->rollback;
- } else { # all's well commit block
- $self->{DBH}->commit;
- }
- $self->{DBH}->{AutoCommit} =3D 1;
- $self->{DBH}->{RaiseError} =3D 0;
- return undef;
-#start changes should be called once before applying individual change requ=
- # Requires publication and ref to columns that will be updated as arguments
-sub start_changes {
- my $self =3D shift;
- my $pub =3D shift || die 'Publication is required';
- my $colref =3D shift || die 'Reference to column array is required';
- $self->{status} =3D 'starting';
- my $qpub =3D $self->{DBH}->quote($pub);
- my $quser =3D $self->{DBH}->quote($self->{user});
- my $qnode =3D $self->{DBH}->quote($self->{node});
- my @cols =3D @{$colref};
- my @subcols =3D $self->GetColList("select col_name from ____subscribed_col=
-s____ where username =3D $quser and nodename =3D $qnode and pubname =3D $qp=
- my %subcols;
- foreach my $col (@subcols) {
- $subcols{$col} =3D 1;
- }
- foreach my $col (@cols) {=09
- return "User/node is not subscribed to column '$col'" if !$subcols{$col};
- }
- my $sql =3D "select pubname, readonly, last_session, post_ver, last_ver, w=
-hereclause, sanity_limit,=20
-sanity_delete, sanity_update, sanity_insert from ____subscribed____ where u=
-sername =3D $quser and pubname =3D $qpub and nodename =3D $qnode";
- my ($junk, $readonly, $last_session, $post_ver, $last_ver, $whereclause, $=
-$sanity_delete, $sanity_update, $sanity_insert) =3D $self->GetOneRow($sql);
- return 'Publication is read only' if $readonly;
- $sql =3D "select whereclause from ____publications____ where pubname =3D $=
- my ($wc) =3D $self->GetOneRow($sql);
- $whereclause =3D '('.$whereclause.')' if $whereclause;
- $whereclause =3D $whereclause.' and ('.$wc.')' if $wc;
- my ($table) =3D $self->GetOneRow("select tablename from ____publications__=
-__ where pubname =3D $qpub");
- return 'Publication is not registered correctly' if !defined($table);
- my %info;
- $info{pub} =3D $pub;
- $info{whereclause} =3D $whereclause;
- $info{post_ver} =3D $post_ver;
- $last_session =3D~ s/([+|-]\d\d?)$/ $1/; #put a space before timezone=09
- $last_session =3D str2time ($last_session); #convert to perltime (seconds =
-since 1970)
- $info{last_session} =3D $last_session;
- $info{last_ver} =3D $last_ver;
- $info{table} =3D $table;
- $info{cols} =3D \@cols;
- my $sql =3D "select count(oid) from $table";
- $sql =3D $sql .' '.$whereclause if $whereclause;
- my ($rowcount) =3D $self->GetOneRow($sql);
- #calculate sanity levels (convert from % to number of rows)
- # limits defined as less than 1 mean no limit
- $info{sanitylimit} =3D $rowcount * ($sanity_limit / 100) if $sanity_limit =
-> 0;
- $info{insertlimit} =3D $rowcount * ($sanity_insert / 100) if $sanity_inser=
-t > 0;
- $info{updatelimit} =3D $rowcount * ($sanity_update / 100) if $sanity_updat=
-e > 0;
- $info{deletelimit} =3D $rowcount * ($sanity_delete / 100) if $sanity_delet=
-e > 0;
- $self->{sanitycount} =3D 0;
- $self->{updatecount} =3D 0;
- $self->{insertcount} =3D 0;
- $self->{deletecount} =3D 0;
- $self->{current} =3D \%info;
- $self->{DBH}->{AutoCommit} =3D 0; #turn on transaction behavior so we can =
-roll back on sanity limits, etc.
- $self->{status} =3D 'ready';
- return undef;
-#call this once all changes are submitted to commit them;
-sub end_changes {
- my $self =3D shift;
- return undef if $self->{status} ne 'ready';
- $self->{DBH}->commit;
- $self->{DBH}->{AutoCommit} =3D 1;
- $self->{status} =3D 'success';
- return undef;
-#call apply_change once for each row level client update
- # Accepts 4 params: rowid, action, timestamp and reference to data array
- # Note: timestamp can be undef, data can be undef
- # timestamp MUST be in perl time (secs since 1970)
-#this routine checks basic timestamp info and sanity limits, then passes th=
-e info along to do_action() for processing
-sub apply_change {
- my $self =3D shift;
- my $rowid =3D shift || return 'Row ID is required'; #don't die just for on=
-e bad row
- my $action =3D shift || return 'Action is required'; #don't die just for o=
-ne bad row
- my $timestamp =3D shift;
- my $dataref =3D shift;
- $action =3D lc($action);
- $timestamp =3D str2time($timestamp) if $timestamp;
- return 'Status failure, cannot accept changes: '.$self->{status} if $self-=
->{status} ne 'ready';
- my %info =3D %{$self->{current}};
- $self->{sanitycount}++;
- if ($info{sanitylimit} && $self->{sanitycount} > $info{sanitylimit}) {
- # too many changes from client
- my $ret =3D $self->sanity('limit');
- return $ret if $ret;
- }
- if ($timestamp && $timestamp > time() + 3600) { # current time + one hour
- #client's clock is way off, cannot submit changes in future
- my $ret =3D $self->collide('future', $info{table}, $rowid, $action, undef=
-, $timestamp, $dataref, $self->{queue});
- return $ret if $ret;
- }
- if ($timestamp && $timestamp < $info{last_session} - 3600) { # last sessio=
-n time less one hour
- #client's clock is way off, cannot submit changes that occured before las=
-t sync date
- my $ret =3D $self->collide('past', $info{table}, $rowid, $action, undef, =
-$timestamp, $dataref , $self->{queue});
- return $ret if $ret;
- }
- my ($crow, $cver, $ctime); #current row,ver,time
- if ($action ne 'insert') {
- my $sql =3D "select ____rowid____, ____rowver____, ____stamp____ from $in=
-fo{table} where ____rowid____ =3D $rowid";
- ($crow, $cver, $ctime) =3D $self->GetOneRow($sql);
- if (!defined($crow)) {
- my $ret =3D $self->collide('norow', $info{table}, $rowid, $action, undef=
-, $timestamp, $dataref , $self->{queue});
- return $ret if $ret;=09=09
- }
- $ctime =3D~ s/([+|-]\d\d?)$/ $1/; #put space between timezone
- $ctime =3D str2time($ctime) if $ctime; #convert to perl time
- if ($timestamp) {
- if ($ctime < $timestamp) {
- my $ret =3D $self->collide('time', $info{table}, $rowid, $action, undef=
-, $timestamp, $dataref, $self->{queue} );=09=09
- return $ret if $ret;
- }
- } else {
- if ($cver > $self->{this_post_ver}) {
- my $ret =3D $self->collide('version', $info{table}, $rowid, $action, un=
-def, $timestamp, $dataref, $self->{queue} );
- return $ret if $ret;
- }
- }
- }
- if ($action eq 'insert') {
- $self->{insertcount}++;
- if ($info{insertlimit} && $self->{insertcount} > $info{insertlimit}) {
- # too many changes from client
- my $ret =3D $self->sanity('insert');
- return $ret if $ret;
- }
- my $qtable =3D $self->{DBH}->quote($info{table});
- my ($rowidsequence) =3D '_'.$self->GetOneRow("select table_id from ____ta=
-bles____ where tablename =3D $qtable").'__rowid_seq';
- return 'Table incorrectly registered, cannot get rowid sequence name: '.$=
-self->{DBH}->errstr() if not defined $rowidsequence;
- my @data;
- foreach my $val (@{$dataref}) {
- push @data, $self->{DBH}->quote($val);
- }
- my $sql =3D "insert into $info{table} (";
- if ($timestamp) {
- $sql =3D $sql . join(',',@{$info{cols}}) . ',____rowver____, ____stamp__=
-__) values (';
- $sql =3D $sql . join (',',@data) .','.$self->{this_post_ver}.',\''.local=
- } else {
- $sql =3D $sql . join(',',@{$info{cols}}) . ',____rowver____) values (';
- $sql =3D $sql . join (',',@data) .','.$self->{this_post_ver}.')';
- }
- my $ret =3D $self->{DBH}->do($sql);
- if (!$ret) {
- my $ret =3D $self->collide($self->{DBH}->errstr(), $info{table}, $rowid,=
- $action, undef, $timestamp, $dataref , $self->{queue});
- return $ret if $ret;=09=09
- }
- my ($newrowid) =3D $self->GetOneRow("select currval('$rowidsequence')");
- return 'Failed to get current rowid on inserted row'.$self->{DBH}->errstr=
- if not defined $newrowid;
- $self->changerowid($rowid, $newrowid);
- }
- if ($action eq 'update') {
- $self->{updatecount}++;
- if ($info{updatelimit} && $self->{updatecount} > $info{updatelimit}) {
- # too many changes from client
- my $ret =3D $self->sanity('update');
- return $ret if $ret;
- }
- my @data;
- foreach my $val (@{$dataref}) {
- push @data, $self->{DBH}->quote($val);
- }=09
- my $sql =3D "update $info{table} set ";
- my @cols =3D @{$info{cols}};
- foreach my $col (@cols) {
- my $val =3D shift @data;
- $sql =3D $sql . "$col =3D $val,";
- }
- $sql =3D $sql." ____rowver____ =3D $self->{this_post_ver}";
- $sql =3D $sql.", ____stamp____ =3D '".localtime($timestamp)."'" if $times=
- $sql =3D $sql." where ____rowid____ =3D $rowid";
- $sql =3D $sql." and $info{whereclause}" if $info{whereclause};
- my $ret =3D $self->{DBH}->do($sql);
- if (!$ret) {
- my $ret =3D $self->collide($self->{DBH}->errstr(), $info{table}, $rowid,=
- $action, undef, $timestamp, $dataref , $self->{queue});
- return $ret if $ret;=09=09
- }
- }
- if ($action eq 'delete') {
- $self->{deletecount}++;
- if ($info{deletelimit} && $self->{deletecount} > $info{deletelimit}) {
- # too many changes from client
- my $ret =3D $self->sanity('delete');
- return $ret if $ret;
- }
- if ($timestamp) {
- my $sql =3D "update $info{table} set ____rowver____ =3D $self->{this_pos=
-t_ver}, ____stamp____ =3D '".localtime($timestamp)."' where ____rowid____ =
-=3D $rowid";
- $sql =3D $sql . " where $info{whereclause}" if $info{whereclause};
- $self->{DBH}->do($sql) || return 'Predelete update failed: '.$self->{DBH=
- } else {
- my $sql =3D "update $info{table} set ____rowver____ =3D $self->{this_pos=
-t_ver} where ____rowid____ =3D $rowid";
- $sql =3D $sql . " where $info{whereclause}" if $info{whereclause};
- $self->{DBH}->do($sql) || return 'Predelete update failed: '.$self->{DBH=
- }
- my $sql =3D "delete from $info{table} where ____rowid____ =3D $rowid";
- $sql =3D $sql . " where $info{whereclause}" if $info{whereclause};
- my $ret =3D $self->{DBH}->do($sql);
- if (!$ret) {
- my $ret =3D $self->collide($self->{DBH}->errstr(), $info{table}, $rowid,=
- $action, undef, $timestamp, $dataref , $self->{queue});
- return $ret if $ret;=09=09
- }
- return undef;
-sub changerowid {
- my $self =3D shift;
- my $oldid =3D shift;
- my $newid =3D shift;
- $self->writeclient('changeid',"$oldid\t$newid");
-#writes info to client
-sub writeclient {
- my $self =3D shift;
- my $type =3D shift;
- my @info =3D @_;
- print "$type: ",join("\t",@info),"\n";
- return undef;
-# Override this for custom behavior. Default is to echo back the sanity fa=
-ilure reason.=20=20
-# If you want to override a collision, you can do so by returning undef.
-sub sanity {
- my $self =3D shift;
- my $reason =3D shift;
- $self->{status} =3D 'sanity exceeded';
- $self->{DBH}->rollback;
- return $reason;
-# Override this for custom behavior. Default is to echo back the failure r=
-# If you want to override a collision, you can do so by returning undef.
-sub collide {
- my $self =3D shift;
- my ($reason,$table,$rowid,$action,$rowver,$timestamp,$data, $queue) =3D @_;
- my @data;
- foreach my $val (@{$data}) {
- push @data, $self->{DBH}->quote($val);
- }=09
- if ($reason =3D~ /integrity/i || $reason =3D~ /constraint/i) {
- $self->{status} =3D 'intergrity violation';
- $self->{DBH}->rollback;
- }
- my $datastring;
- my @cols =3D @{$self->{current}->{cols}};
- foreach my $col (@cols) {
- my $val =3D shift @data;
- $datastring =3D $datastring . "$col =3D $val,";
- }
- chop $datastring; #remove trailing comma
- if ($queue eq 'server') {
- $timestamp =3D localtime($timestamp) if defined($timestamp);
- $rowid =3D $self->{DBH}->quote($rowid);
- $rowid =3D 'null' if !defined($rowid);
- $rowver =3D 'null' if !defined($rowver);
- $timestamp =3D $self->{DBH}->quote($timestamp);
- $data =3D $self->{DBH}->quote($data);
- my $qtable =3D $self->{DBH}->quote($table);
- my $qreason =3D $self->{DBH}->quote($reason);
- my $qaction =3D $self->{DBH}->quote($action);
- my $quser =3D $self->{DBH}->quote($self->{user});
- my $qnode =3D $self->{DBH}->quote($self->{node});
- $datastring =3D $self->{DBH}->quote($datastring);
- my $sql =3D "insert into ____collision____ (rowid,
-tablename, rowver, stamp, data, reason, action, username,
-nodename, queue) values($rowid,$qtable, $rowver, $timestamp,$datastring,
-$qreason, $qaction,$quser, $qnode)";
- $self->{DBH}->do($sql) || die 'Failed to write to collision table: '.$sel=
- } else {
- $self->writeclient('collision',$rowid,$table, $rowver, $timestamp,$reason=
-, $action,$self->{user}, $self->{node}, $data);
- }
- return $reason;
-#calls get_updates once for each publication the user/node is subscribed to=
- in correct sync_order
-sub get_all_updates {
- my $self =3D shift;
- my $quser =3D $self->{DBH}->quote($self->{user});
- my $qnode =3D $self->{DBH}->quote($self->{node});
- foreach my $pub (@{$self->{orderpubs}}) {
- $self->get_updates($pub, 1); #request update as sync unless overrridden b=
-y flags
- }
-# Call this once for each table the client needs refreshed or sync'ed AFTER=
- all inbound client changes have been posted
-# Accepts publication and sync flag as arguments
-sub get_updates {
- my $self =3D shift;
- my $pub =3D shift || die 'Publication is required';
- my $sync =3D shift;
- my $qpub =3D $self->{DBH}->quote($pub);
- my $quser =3D $self->{DBH}->quote($self->{user});
- my $qnode =3D $self->{DBH}->quote($self->{node});
- #enforce refresh and refreshonce flags
- undef $sync if !$self->{pubs}->{$pub};=20
- my %info =3D $self->{current};
- my @cols =3D $self->GetColList("select col_name from ____subscribed_cols__=
-__ where username =3D $quser and nodename =3D $qnode and pubname =3D $qpub"=
- my ($table) =3D $self->GetOneRow("select tablename from ____publications__=
-__ where pubname =3D $qpub");
- return 'Table incorrectly registered for read' if !defined($table);
- my $qtable =3D $self->{DBH}->quote($table);=09
- my $sql =3D "select pubname, last_session, post_ver, last_ver, whereclause=
- from ____subscribed____ where username =3D $quser and pubname =3D $qpub an=
-d nodename =3D $qnode";
- my ($junk, $last_session, $post_ver, $last_ver, $whereclause) =3D $self->G=
- my ($wc) =3D $self->GetOneRow("select whereclause from ____publications___=
-_ where pubname =3D $qpub");
- $whereclause =3D '('.$whereclause.')' if $whereclause;
- $whereclause =3D $whereclause.' and ('.$wc.')' if $wc;
- if ($sync) {
- $self->writeclient('start synchronize', $pub);
- } else {
- $self->writeclient('start refresh', $pub);
- $self->{DBH}->do("update ____subscribed____ set refreshonce =3D false whe=
-re pubname =3D $qpub and username =3D $quser and nodename =3D $qnode") || r=
-eturn 'Failed to clear RefreshOnce flag: '.$self->{DBH}->errstr;
- }
- $self->writeclient('columns',@cols);
- my $sql =3D "select ____rowid____, ".join(',', @cols)." from $table";
- if ($sync) {
- $sql =3D $sql." where (____rowver____ <=3D $self->{max_ver} and ____rowve=
-r____ > $last_ver)";
- if (defined($self->{this_post_ver})) {
- $sql =3D $sql . " and (____rowver____ <> $post_ver)";
- }
- } else {
- $sql =3D $sql." where (____rowver____ <=3D $self->{max_ver})";
- }
- $sql =3D $sql." and $whereclause" if $whereclause;
- my $sth =3D $self->{DBH}->prepare($sql) || return 'Failed to get prepare S=
-QL for updates: '.$self->{DBH}->errstr;
- $sth->execute || return 'Failed to execute SQL for updates: '.$self->{DBH}=
- my @row;
- while (@row =3D $sth->fetchrow_array) {
- $self->writeclient('update/insert',@row);
- }
- $sth->finish;
- # now get deleted rows
- if ($sync) {
- $sql =3D "select rowid from ____deleted____ where (tablename =3D $qtable)=
- $sql =3D $sql." and (rowver <=3D $self->{max_ver} and rowver > $last_ver)=
- if (defined($self->{this_post_ver})) {
- $sql =3D $sql . " and (rowver <> $self->{this_post_ver})";
- }
- $sql =3D $sql." and $whereclause" if $whereclause;
- $sth =3D $self->{DBH}->prepare($sql) || return 'Failed to get prepare SQL=
- for deletes: '.$self->{DBH}->errstr;
- $sth->execute || return 'Failed to execute SQL for deletes: '.$self->{DBH=
- my @row;
- while (@row =3D $sth->fetchrow_array) {
- $self->writeclient('delete',@row);
- }
- $sth->finish;
- }
- if ($sync) {
- $self->writeclient('end synchronize', $pub);
- } else {
- $self->writeclient('end refresh', $pub);
- }
- my $qpub =3D $self->{DBH}->quote($pub);
- my $quser =3D $self->{DBH}->quote($self->{user});
- my $qnode =3D $self->{DBH}->quote($self->{node});
- $self->{DBH}->do("update ____subscribed____ set last_ver =3D $self->{max_v=
-er}, last_session =3D now(), post_ver =3D $self->{this_post_ver} where user=
-name =3D $quser and nodename =3D $qnode and pubname =3D $qpub");
- return undef;
-# Call this once when everything else is done. Does housekeeping.=20
-sub DESTROY {
- my $self =3D shift;
-#release version from lock table (including old ones)
- my $quser =3D $self->{DBH}->quote($self->{user});
- my $qnode =3D $self->{DBH}->quote($self->{node});
- my $sql =3D "delete from ____last_stable____ where username =3D $quser and=
- nodename =3D $qnode";
- $self->{DBH}->do($sql);
-#clean up deleted table
- my ($version) =3D $self->GetOneRow("select min(last_ver) from ____subscrib=
- return undef if not defined $version;
- $self->{DBH}->do("delete from ____deleted____ where rowver < $version") ||=
- return 'Failed to prune deleted table'.$self->{DBH}->errstr;;
-#disconnect from DBD sessions
- $self->{DBH}->disconnect;
- $self->{DBLOG}->disconnect;
- return undef;
-############# Helper Subs ############
-sub GetColList {
- my $self =3D shift;
- my $sql =3D shift || die 'Must provide sql select statement';
- my $sth =3D $self->{DBH}->prepare($sql) || return undef;
- $sth->execute || return undef;
- my $val;
- my @col;
- while (($val) =3D $sth->fetchrow_array) {
- push @col, $val;
- }
- $sth->finish;
- return @col;
-sub GetOneRow {
- my $self =3D shift;
- my $sql =3D shift || die 'Must provide sql select statement';
- my $sth =3D $self->{DBH}->prepare($sql) || return undef;
- $sth->execute || return undef;
- my @row =3D $sth->fetchrow_array;
- $sth->finish;
- return @row;
-package SyncManager;
-use DBI;
-# new requires 3 arguments: dbi connection string, plus the corresponding u=
-sername and password
-sub new {
- my $proto =3D shift;
- my $class =3D ref($proto) || $proto;
- my $self =3D {};
- my $dbi =3D shift;
- my $user =3D shift;
- my $pass =3D shift;
- $self->{DBH} =3D DBI->connect($dbi,$user,$pass) || die "Failed to connect =
-to database: ".DBI->errstr();
- $self->{DBLOG}=3D DBI->connect($dbi,$user,$pass) || die "cannot log to DB:=
- ".DBI->errstr();
- return bless ($self, $class);
-sub dblog {=20
- my $self =3D shift;
- my $msg =3D $self->{DBLOG}->quote($_[0]);
- my $quser =3D $self->{DBH}->quote($self->{user});
- my $qnode =3D $self->{DBH}->quote($self->{node});
- $self->{DBLOG}->do("insert into ____sync_log____ (username, nodename,stamp=
-, message) values($quser, $qnode, now(), $msg)");
-#this should never need to be called, but it might if a node bails without =
-releasing their locks
-sub ReleaseAllLocks {
- my $self =3D shift;
- $self->{DBH}->do("delete from ____last_stable____)");
-# Adds a publication to the system. Also adds triggers, sequences, etc ass=
-ociated with the table if approproate.
- # accepts two argument: the name of a physical table and the name under wh=
-ich to publish it=20
- # NOTE: the publication name is optional and will default to the table na=
-me if not supplied
- # returns undef if ok, else error string;
-sub publish {
- my $self =3D shift;
- my $table =3D shift || die 'You must provide a table name (and optionally =
-a unique publication name)';
- my $pub =3D shift;
- $pub =3D $table if not defined($pub);
- my $qpub =3D $self->{DBH}->quote($pub);
- my $sql =3D "select tablename from ____publications____ where pubname =3D =
- my ($junk) =3D $self->GetOneRow($sql);
- return 'Publication already exists' if defined($junk);
- my $qtable =3D $self->{DBH}->quote($table);
- $sql =3D "select table_id, refcount from ____tables____ where tablename =
-=3D $qtable";
- my ($id, $refcount) =3D $self->GetOneRow($sql);
- if(!defined($id)) {
- $self->{DBH}->do("insert into ____tables____ (tablename, refcount) values=
- ($qtable,1)") || return 'Failed to register table: ' . $self->{DBH}->errst=
- my $sql =3D "select table_id from ____tables____ where tablename =3D $qta=
- ($id) =3D $self->GetOneRow($sql);
- }
- if (defined($refcount)) {
- $self->{DBH}->do("update ____tables____ set refcount =3D refcount+1 where=
- table_id =3D $id") || return 'Failed to update refrence count: ' . $self->=
- } else {
- $id =3D '_'.$id.'_';=20
- my @cols =3D $self->GetTableCols($table, 1); # 1 =3D get hidden cols too
- my %skip;
- foreach my $col (@cols) {
- $skip{$col} =3D 1;
- }
- if (!$skip{____rowver____}) {
- $self->{DBH}->do("alter table $table add column ____rowver____ int4"); #=
-don't fail here in case table is being republished, just accept the error s=
- }
- $self->{DBH}->do("update $table set ____rowver____ =3D ____version_seq___=
-_.last_value - 1") || return 'Failed to initialize rowver: ' . $self->{DBH}=
- if (!$skip{____rowid____}) {
- $self->{DBH}->do("alter table $table add column ____rowid____ int4"); #d=
-on't fail here in case table is being republished, just accept the error si=
- }
- my $index =3D $id.'____rowid____idx';
- $self->{DBH}->do("create index $index on $table(____rowid____)") || retur=
-n 'Failed to create rowid index: ' . $self->{DBH}->errstr;
- my $sequence =3D $id.'_rowid_seq';
- $self->{DBH}->do("create sequence $sequence") || return 'Failed to create=
- rowver sequence: ' . $self->{DBH}->errstr;
- $self->{DBH}->do("alter table $table alter column ____rowid____ set defau=
-lt nextval('$sequence')"); #don't fail here in case table is being republis=
-hed, just accept the error silently
- $self->{DBH}->do("update $table set ____rowid____ =3D nextval('$sequence=
-')") || return 'Failed to initialize rowid: ' . $self->{DBH}->errstr;
- if (!$skip{____stamp____}) {
- $self->{DBH}->do("alter table $table add column ____stamp____ timestamp"=
-); #don't fail here in case table is being republished, just accept the err=
-or silently
- }
- $self->{DBH}->do("update $table set ____stamp____ =3D now()") || return =
-'Failed to initialize stamp: ' . $self->{DBH}->errstr;
- my $trigger =3D $id.'_ver_ins';
- $self->{DBH}->do("create trigger $trigger before insert on $table for eac=
-h row execute procedure sync_insert_ver()") || return 'Failed to create tri=
-gger: ' . $self->{DBH}->errstr;
- my $trigger =3D $id.'_ver_upd';
- $self->{DBH}->do("create trigger $trigger before update on $table for eac=
-h row execute procedure sync_update_ver()") || return 'Failed to create tri=
-gger: ' . $self->{DBH}->errstr;
- my $trigger =3D $id.'_del_row';
- $self->{DBH}->do("create trigger $trigger after delete on $table for each=
- row execute procedure sync_delete_row()") || return 'Failed to create trig=
-ger: ' . $self->{DBH}->errstr;
- }
- $self->{DBH}->do("insert into ____publications____ (pubname, tablename) va=
-lues ('$pub','$table')") || return 'Failed to create publication entry: '.$=
- return undef;
-# Removes a publication from the system. Also drops triggers, sequences, e=
-tc associated with the table if approproate.
- # accepts one argument: the name of a publication
- # returns undef if ok, else error string;
-sub unpublish {
- my $self =3D shift;
- my $pub =3D shift || return 'You must provide a publication name';
- my $qpub =3D $self->{DBH}->quote($pub);
- my $sql =3D "select tablename from ____publications____ where pubname =3D =
- my ($table) =3D $self->GetOneRow($sql);
- return 'Publication does not exist' if !defined($table);
- my $qtable =3D $self->{DBH}->quote($table);
- $sql =3D "select table_id, refcount from ____tables____ where tablename =
-=3D $qtable";
- my ($id, $refcount) =3D $self->GetOneRow($sql);
- return 'Table: $table is not correctly registered!' if not defined($id);
- $self->{DBH}->do("update ____tables____ set refcount =3D refcount -1 where=
- tablename =3D $qtable") || return 'Failed to decrement reference count: ' =
-. $self->{DBH}->errstr;
- $self->{DBH}->do("delete from ____subscribed____ where pubname =3D $qpub")=
- || return 'Failed to delete user subscriptions: ' . $self->{DBH}->errstr;
- $self->{DBH}->do("delete from ____subscribed_cols____ where pubname =3D $q=
-pub") || return 'Failed to delete subscribed columns: ' . $self->{DBH}->err=
- $self->{DBH}->do("delete from ____publications____ where tablename =3D $qt=
-able and pubname =3D $qpub") || return 'Failed to delete from publications:=
- ' . $self->{DBH}->errstr;
- #if this is the last reference, we want to drop triggers, etc;
- if ($refcount <=3D 1) {
- $id =3D "_".$id."_";
- $self->{DBH}->do("alter table $table alter column ____rowver____ drop def=
-ault") || return 'Failed to alter column default: ' . $self->{DBH}->errstr;
- $self->{DBH}->do("alter table $table alter column ____rowid____ drop defa=
-ult") || return 'Failed to alter column default: ' . $self->{DBH}->errstr;
- $self->{DBH}->do("alter table $table alter column ____stamp____ drop defa=
-ult") || return 'Failed to alter column default: ' . $self->{DBH}->errstr;
- my $trigger =3D $id.'_ver_upd';
- $self->{DBH}->do("drop trigger $trigger on $table") || return 'Failed to =
-drop trigger: ' . $self->{DBH}->errstr;
- my $trigger =3D $id.'_ver_ins';
- $self->{DBH}->do("drop trigger $trigger on $table") || return 'Failed to =
-drop trigger: ' . $self->{DBH}->errstr;
- my $trigger =3D $id.'_del_row';
- $self->{DBH}->do("drop trigger $trigger on $table") || return 'Failed to =
-drop trigger: ' . $self->{DBH}->errstr;
- my $sequence =3D $id.'_rowid_seq';
- $self->{DBH}->do("drop sequence $sequence") || return 'Failed to drop seq=
-uence: ' . $self->{DBH}->errstr;
- my $index =3D $id.'____rowid____idx';
- $self->{DBH}->do("drop index $index") || return 'Failed to drop index: ' =
-. $self->{DBH}->errstr;
- $self->{DBH}->do("delete from ____tables____ where tablename =3D $qtable"=
-) || return 'remove entry from tables: ' . $self->{DBH}->errstr;
- }
-return undef;
-#Subscribe user/node to a publication
- # Accepts 3 arguements: Username, Nodename, Publication
- # NOTE: the remaining arguments can be supplied as column names to which =
-the user/node should be subscribed
- # Return undef if ok, else returns an error string
-sub subscribe {
- my $self =3D shift;
- my $user =3D shift || die 'You must provide user, node and publication as =
- my $node =3D shift || die 'You must provide user, node and publication as =
- my $pub =3D shift || die 'You must provide user, node and publication as a=
- my @cols =3D @_;
- my $quser =3D $self->{DBH}->quote($user);
- my $qnode =3D $self->{DBH}->quote($node);
- my $qpub =3D $self->{DBH}->quote($pub);
- my $sql =3D "select tablename from ____publications____ where pubname =3D =
- my ($table) =3D $self->GetOneRow($sql);
- return "Publication $pub does not exist." if not defined $table;
- my $qtable =3D $self->{DBH}->quote($table);
- @cols =3D $self->GetTableCols($table) if !@cols; # get defaults if cols we=
-re not spefified by caller
- $self->{DBH}->do("insert into ____subscribed____ (username, nodename,pubna=
-me,last_ver,refreshonce) values('$user', '$node','$pub',0, true)") || retur=
-n 'Failes to create subscription: ' . $self->{DBH}->errstr;=09
- foreach my $col (@cols) {
- $self->{DBH}->do("insert into ____subscribed_cols____ (username, nodename=
-, pubname, col_name) values ('$user','$node','$pub','$col')") || return 'Fa=
-iles to subscribe column: ' . $self->{DBH}->errstr;=09
- }
- return undef;
-#Unsubscribe user/node to a publication
- # Accepts 3 arguements: Username, Nodename, Publication
- # Return undef if ok, else returns an error string
-sub unsubscribe {
- my $self =3D shift;
- my $user =3D shift || die 'You must provide user, node and publication as =
- my $node =3D shift || die 'You must provide user, node and publication as =
- my $pub =3D shift || die 'You must provide user, node and publication as a=
- my @cols =3D @_;
- my $quser =3D $self->{DBH}->quote($user);
- my $qnode =3D $self->{DBH}->quote($node);
- my $qpub =3D $self->{DBH}->quote($pub);
- my $sql =3D "select tablename from ____publications____ where pubname =3D =
- my $table =3D $self->GetOneRow($sql);
- return "Publication $pub does not exist." if not defined $table;
- $self->{DBH}->do("delete from ____subscribed_cols____ where pubname =3D $q=
-pub and username =3D $quser and nodename =3D $qnode") || return 'Failed to =
-remove column subscription: '. $self->{DBH}->errstr;
- $self->{DBH}->do("delete from ____subscribed____ where pubname =3D $qpub a=
-nd username =3D $quser and nodename =3D $qnode") || return 'Failed to remov=
-e subscription: '. $self->{DBH}->errstr;
- return undef;
-#INSTALL creates the necessary management tables.=20=20
- #returns undef if everything is ok, else returns a string describing the e=
-sub INSTALL {
-my $self =3D shift;
-#check to see if management tables are already installed
-my ($test) =3D $self->GetOneRow("select * from pg_class where relname =3D '=
-if (defined($test)) {
- return 'It appears that synchronization manangement tables are already ins=
-talled here. Please uninstall before reinstalling.';
-#install the management tables, etc.
-$self->{DBH}->do("create table ____publications____ (pubname text primary k=
-ey,description text, tablename text, sync_order int4, whereclause text)") |=
-| return $self->{DBH}->errstr();
-$self->{DBH}->do("create table ____subscribed_cols____ (nodename text, user=
-name text, pubname text, col_name text, description text, primary key(noden=
-ame, username, pubname,col_name))") || return $self->{DBH}->errstr();
-$self->{DBH}->do("create table ____subscribed____ (nodename text, username =
-text, pubname text, last_session timestamp, post_ver int4, last_ver int4, w=
-hereclause text, sanity_limit int4 default 0, sanity_delete int4 default 0,=
- sanity_update int4 default 0, sanity_insert int4 default 50, readonly bool=
-ean, disabled boolean, fullrefreshonly boolean, refreshonce boolean, primar=
-y key(nodename, username, pubname))") || return $self->{DBH}->errstr();
-$self->{DBH}->do("create table ____last_stable____ (version int4, username =
-text, nodename text, primary key(version, username, nodename))") || return =
-$self->{DBH}->do("create table ____tables____ (tablename text, table_id int=
-4, refcount int4, primary key(tablename, table_id))") || return $self->{DBH=
-$self->{DBH}->do("create sequence ____table_id_seq____") || return $self->{=
-$self->{DBH}->do("alter table ____tables____ alter column table_id set defa=
-ult nextval('____table_id_seq____')") || return $self->{DBH}->errstr();
-$self->{DBH}->do("create table ____deleted____ (rowid int4, tablename text,=
- rowver int4, stamp timestamp, primary key (rowid, tablename))") || return =
-$self->{DBH}->do("create table ____collision____ (rowid text, tablename tex=
-t, rowver int4, stamp timestamp, faildate timestamp default now(),data text=
-,reason text, action text, username text, nodename text,queue text)") || re=
-turn $self->{DBH}->errstr();
-$self->{DBH}->do("create sequence ____version_seq____") || return $self->{D=
-$self->{DBH}->do("create table ____sync_log____ (username text, nodename te=
-xt, stamp timestamp, message text)") || return $self->{DBH}->errstr();
-$self->{DBH}->do("create function sync_insert_ver() returns opaque as
-if new.____rowver____ isnull then
-new.____rowver____ :=3D ____version_seq____.last_value;
-end if;
-if new.____stamp____ isnull then
-new.____stamp____ :=3D now();
-end if;
-return NEW;
-end;' language 'plpgsql'") || return $self->{DBH}->errstr();
-$self->{DBH}->do("create function sync_update_ver() returns opaque as
-if new.____rowver____ =3D old.____rowver____ then
-new.____rowver____ :=3D ____version_seq____.last_value;
-end if;
-if new.____stamp____ =3D old.____stamp____ then
-new.____stamp____ :=3D now();
-end if;
-return NEW;
-end;' language 'plpgsql'") || return $self->{DBH}->errstr();
-$self->{DBH}->do("create function sync_delete_row() returns opaque as=20
-insert into ____deleted____ (rowid,tablename,rowver,stamp) values
-(old.____rowid____, TG_RELNAME, old.____rowver____,old.____stamp____);=20
-return old;=20
-end;' language 'plpgsql'") || return $self->{DBH}->errstr();
-return undef;
-#removes all management tables & related stuff
- #returns undef if ok, else returns an error message as a string
-my $self =3D shift;
-#Make sure all tables are unpublished first
-my $sth =3D $self->{DBH}->prepare("select pubname from ____publications____=
-my $pub;
-while (($pub) =3D $sth->fetchrow_array) {
- $self->unpublish($pub);=09
-$self->{DBH}->do("drop table ____publications____") || return $self->{DBH}-=
-$self->{DBH}->do("drop table ____subscribed_cols____") || return $self->{DB=
-$self->{DBH}->do("drop table ____subscribed____") || return $self->{DBH}->e=
-$self->{DBH}->do("drop table ____last_stable____") || return $self->{DBH}->=
-$self->{DBH}->do("drop table ____deleted____") || return $self->{DBH}->errs=
-$self->{DBH}->do("drop table ____collision____") || return $self->{DBH}->er=
-$self->{DBH}->do("drop table ____tables____") || return $self->{DBH}->errst=
-$self->{DBH}->do("drop table ____sync_log____") || return $self->{DBH}->err=
-$self->{DBH}->do("drop sequence ____table_id_seq____") || return $self->{DB=
-$self->{DBH}->do("drop sequence ____version_seq____") || return $self->{DBH=
-$self->{DBH}->do("drop function sync_insert_ver()") || return $self->{DBH}-=
-$self->{DBH}->do("drop function sync_update_ver()") || return $self->{DBH}-=
-$self->{DBH}->do("drop function sync_delete_row()") || return $self->{DBH}-=
-return undef;
-sub DESTROY {
- my $self =3D shift;
- $self->{DBH}->disconnect;
- $self->{DBLOG}->disconnect;
- return undef;
-############# Helper Subs ############
-sub GetOneRow {
- my $self =3D shift;
- my $sql =3D shift || die 'Must provide sql select statement';
- my $sth =3D $self->{DBH}->prepare($sql) || return undef;
- $sth->execute || return undef;
- my @row =3D $sth->fetchrow_array;
- $sth->finish;
- return @row;
-#call this with second non-zero value to get hidden columns
-sub GetTableCols {
- my $self =3D shift;
- my $table =3D shift || die 'Must provide table name';
- my $wanthidden =3D shift;
- my $sql =3D "select * from $table where 0 =3D 1";
- my $sth =3D $self->{DBH}->prepare($sql) || return undef;
- $sth->execute || return undef;
- my @row =3D @{$sth->{NAME}};
- $sth->finish;
- return @row if $wanthidden;
- my @cols;
- foreach my $col (@row) {
- next if $col eq '____rowver____';
- next if $col eq '____stamp____';
- next if $col eq '____rowid____';
- push @cols, $col;=09
- }
- return @cols;
-1; #happy require
-From pgsql-hackers-owner+M9917@postgresql.org Mon Jun 11 15:53:25 2001
-Return-path: <pgsql-hackers-owner+M9917@postgresql.org>
-Received: from postgresql.org (webmail.postgresql.org [])
- by candle.pha.pa.us (8.10.1/8.10.1) with ESMTP id f5BJrPL01206
- for <pgman@candle.pha.pa.us>; Mon, 11 Jun 2001 15:53:25 -0400 (EDT)
-Received: from postgresql.org.org (webmail.postgresql.org [])
- by postgresql.org (8.11.3/8.11.1) with SMTP id f5BJrPE67753;
- Mon, 11 Jun 2001 15:53:25 -0400 (EDT)
- (envelope-from pgsql-hackers-owner+M9917@postgresql.org)
-Received: from mail.greatbridge.com (mail.greatbridge.com [])
- by postgresql.org (8.11.3/8.11.1) with ESMTP id f5BJmLE65620
- for <pgsql-hackers@postgresql.org>; Mon, 11 Jun 2001 15:48:21 -0400 (EDT)
- (envelope-from djohnson@greatbridge.com)
-Received: from j2.us.greatbridge.com (djohnsonpc.us.greatbridge.com [])
- by mail.greatbridge.com (8.11.2/8.11.2) with SMTP id f5BJm2Q28847
- for <pgsql-hackers@postgresql.org>; Mon, 11 Jun 2001 15:48:02 -0400
-From: Darren Johnson <djohnson@greatbridge.com>
-Date: Mon, 11 Jun 2001 19:46:44 GMT
-Message-ID: <20010611.19464400@j2.us.greatbridge.com>
-Subject: [HACKERS] Postgres Replication
-To: pgsql-hackers@postgresql.org
-Reply-To: Darren Johnson <djohnson@greatbridge.com>
-X-Mailer: Mozilla/3.0 (compatible; StarOffice/5.2;Linux)
-X-Priority: 3 (Normal)
-MIME-Version: 1.0
-Content-Type: text/plain; charset=ISO-8859-1
-Content-Transfer-Encoding: 8bit
-X-MIME-Autoconverted: from quoted-printable to 8bit by postgresql.org id f5BJmLE65621
-Precedence: bulk
-Sender: pgsql-hackers-owner@postgresql.org
-Status: OR
-We have been researching replication for several months now, and
-I have some opinions to share to the community for feedback,
-discussion, and/or participation. Our goal is to get a replication
-solution for PostgreSQL that will meet most needs of users
-and applications alike (mission impossible theme here :).
-My research work along with others contributors has been collected
-and presented here http://www.greatbridge.org/genpage?replication_top
-If there is something missing, especially PostgreSQL related
-work, I would like to know about it, and my apologies to any
-one who got left off the list. This work is ongoing and doesn't
-draw a conclusion, which IMHO should be left up to the user,
-but I'm offering my opinions to spur discussion and/or feed back
-from this list, and try not to offend any one.
-Here's my opinion: of the approaches we've surveyed, the most
-promising one is the Postgres-R project from the Information and
-Communication Systems Group, ETH in Zurich, Switzerland, originally
-produced by Bettina Kemme, Gustavo Alonso, and others. Although
-Postgres-R is a synchronous approach, I believe it is the closest to
-the goal mentioned above. Here is an abstract of the advantages.
-1) Postgres-R is built on the PostgreSQL-6.4.2 code base. The
-functionality is an optional parameter, so there will be insignificant
-overhead for non replication situations. The replication and
-managers are the two new modules added to the PostgreSQL code base.
-2) The replication manager's main function is controlling the
-replication protocol via a message handling process. It receives
-messages from the local and remote backends and forwards write
-sets and decision messages via the communication manager to the
-other servers. The replication manager controls all the transactions
-running on the local server by keeping track of the states, including
-which protocol phase (read, send, lock, or write) the transaction is
-in. The replication manager maintains a two way channel
-implemented as buffered sockets to each backend.
-3) The main task of the communication manager is to provide simple
-socket based interface between the replication manager and the
-group communication system (currently Ensemble). The
-communication system is a cluster of servers connected via
-the communication manager. The replication manager also maintains
-three one-way channels to the communication system: a broadcast
-channel to send messages, a total-order channel to receive
-totally orders write sets, and a no-order channel to listen for
-decision messages from the communication system. Decision
-messages can be received at any time where the reception of
-totally ordered write sets can be blocked in certain phases.
-4) Based on a two phase locking approach, all dead lock situations
-are local and detectable by Postgres-R code base, and aborted.
-5) The write set messages used to send database changes to other
-servers, can use either the SQL statements or the actual tuples
-changed. This is a parameter based on number of tuples changed
-by a transaction. While sending the tuple changes reduces
-overhead in query parse, plan and execution, there is a negative
-effect in sending a large write set across the network.
-6) Postgres-R uses a synchronous approach that keeps the data on
-all sites consistent and provides serializability. The user does not
-have to bother with conflict resolution, and receives the same
-correctness and consistency of a centralized system.
-7) Postgres-R could be part of a good fault-resilient and load
-solution. It is peer-to-peer based and incurs low overhead propagating
-updates to the other cluster members. All replicated databases locally
-process queries.
-8) Compared to other synchronous replication strategies (e.g., standard
-distributed 2-phase-locking + 2-phase-commit), Postgres-R has much
-better performance using 2-phase-locking.
-There are some issues that are not currently addressed by
-Postgres-R, but some enhancements made to PostgreSQL since the
-6.4.2 tree are very favorable to addressing these short comings.
-1) The addition of WAL in 7.1 has the information for recovering
-failed/off-line servers, currently all the servers would have to be
-stopped, and a copy would be used to get all the servers synchronized
-before starting again.
-2)Being synchronous, Postgres-R would not be a good solution
-for off line/WAN scenarios where asynchronous replication is
-required. There are some theories on this issue which involve servers
-connecting and disconnecting from the cluster.
-3)As in any serialized synchronous approach there is change in the
-flow of execution of a transaction; while most of these changes can
-be solved by calling newly developed functions at certain time points,
-synchronous replica control is tightly coupled with the concurrency
-Hence, especially in PostgreSQL 7.2 some parts of the concurrency control
-(MVCC) might have to be adjusted. This can lead to a slightly more
-complicated maintenance than a system that does not change the backend.
-4)Partial replication is not addressed.
-Any feedback on this post will be appreciated.
----------------------------(end of broadcast)---------------------------
-TIP 2: you can get off all lists at once with the unregister command
- (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
-From pgsql-hackers-owner+M9923@postgresql.org Mon Jun 11 18:14:23 2001
-Return-path: <pgsql-hackers-owner+M9923@postgresql.org>
-Received: from postgresql.org (webmail.postgresql.org [])
- by candle.pha.pa.us (8.10.1/8.10.1) with ESMTP id f5BMENL18644
- for <pgman@candle.pha.pa.us>; Mon, 11 Jun 2001 18:14:23 -0400 (EDT)
-Received: from postgresql.org.org (webmail.postgresql.org [])
- by postgresql.org (8.11.3/8.11.1) with SMTP id f5BMEQE14877;
- Mon, 11 Jun 2001 18:14:26 -0400 (EDT)
- (envelope-from pgsql-hackers-owner+M9923@postgresql.org)
-Received: from spoetnik.xs4all.nl (spoetnik.xs4all.nl [])
- by postgresql.org (8.11.3/8.11.1) with ESMTP id f5BM6ME12270
- for <pgsql-hackers@postgresql.org>; Mon, 11 Jun 2001 18:06:23 -0400 (EDT)
- (envelope-from reinoud@xs4all.nl)
-Received: from KAYAK (kayak [])
- by spoetnik.xs4all.nl (Postfix) with SMTP id 865A33E1B
- for <pgsql-hackers@postgresql.org>; Tue, 12 Jun 2001 00:06:16 +0200 (CEST)
-From: reinoud@xs4all.nl (Reinoud van Leeuwen)
-To: pgsql-hackers@postgresql.org
-Subject: Re: [HACKERS] Postgres Replication
-Date: Mon, 11 Jun 2001 22:06:07 GMT
-Organization: Not organized in any way
-Reply-To: reinoud@xs4all.nl
-Message-ID: <3b403d96.562404297@>
-References: <20010611.19464400@j2.us.greatbridge.com>
-In-Reply-To: <20010611.19464400@j2.us.greatbridge.com>
-X-Mailer: Forte Agent 1.5/32.451
-MIME-Version: 1.0
-Content-Type: text/plain; charset=us-ascii
-Content-Transfer-Encoding: 8bit
-X-MIME-Autoconverted: from quoted-printable to 8bit by postgresql.org id f5BM6PE12276
-Precedence: bulk
-Sender: pgsql-hackers-owner@postgresql.org
-Status: OR
-On Mon, 11 Jun 2001 19:46:44 GMT, you wrote:
->We have been researching replication for several months now, and
->I have some opinions to share to the community for feedback,
->discussion, and/or participation. Our goal is to get a replication
->solution for PostgreSQL that will meet most needs of users
->and applications alike (mission impossible theme here :).
->My research work along with others contributors has been collected
->and presented here http://www.greatbridge.org/genpage?replication_top
->If there is something missing, especially PostgreSQL related
->work, I would like to know about it, and my apologies to any
->one who got left off the list. This work is ongoing and doesn't
->draw a conclusion, which IMHO should be left up to the user,
->but I'm offering my opinions to spur discussion and/or feed back
->from this list, and try not to offend any one.
->Here's my opinion: of the approaches we've surveyed, the most
->promising one is the Postgres-R project from the Information and
->Communication Systems Group, ETH in Zurich, Switzerland, originally
->produced by Bettina Kemme, Gustavo Alonso, and others. Although
->Postgres-R is a synchronous approach, I believe it is the closest to
->the goal mentioned above. Here is an abstract of the advantages.
->1) Postgres-R is built on the PostgreSQL-6.4.2 code base. The
->functionality is an optional parameter, so there will be insignificant
->overhead for non replication situations. The replication and
->managers are the two new modules added to the PostgreSQL code base.
->2) The replication manager's main function is controlling the
->replication protocol via a message handling process. It receives
->messages from the local and remote backends and forwards write
->sets and decision messages via the communication manager to the
->other servers. The replication manager controls all the transactions
->running on the local server by keeping track of the states, including
->which protocol phase (read, send, lock, or write) the transaction is
->in. The replication manager maintains a two way channel
->implemented as buffered sockets to each backend.
-what does "manager controls all the transactions" mean? I hope it does
-*not* mean that a bug in the manager would cause transactions not to
->3) The main task of the communication manager is to provide simple
->socket based interface between the replication manager and the
->group communication system (currently Ensemble). The
->communication system is a cluster of servers connected via
->the communication manager. The replication manager also maintains
->three one-way channels to the communication system: a broadcast
->channel to send messages, a total-order channel to receive
->totally orders write sets, and a no-order channel to listen for
->decision messages from the communication system. Decision
->messages can be received at any time where the reception of
->totally ordered write sets can be blocked in certain phases.
->4) Based on a two phase locking approach, all dead lock situations
->are local and detectable by Postgres-R code base, and aborted.
-Does this imply locking over different servers? That would mean a
-grinding halt when a network outage occurs...
->5) The write set messages used to send database changes to other
->servers, can use either the SQL statements or the actual tuples
->changed. This is a parameter based on number of tuples changed
->by a transaction. While sending the tuple changes reduces
->overhead in query parse, plan and execution, there is a negative
->effect in sending a large write set across the network.
->6) Postgres-R uses a synchronous approach that keeps the data on
->all sites consistent and provides serializability. The user does not
->have to bother with conflict resolution, and receives the same
->correctness and consistency of a centralized system.
->7) Postgres-R could be part of a good fault-resilient and load
->solution. It is peer-to-peer based and incurs low overhead propagating
->updates to the other cluster members. All replicated databases locally
->process queries.
->8) Compared to other synchronous replication strategies (e.g., standard
->distributed 2-phase-locking + 2-phase-commit), Postgres-R has much
->better performance using 2-phase-locking.
-Coming from a Sybase background I have some experience with
-replication. The way it works in Sybase Replication server is as
-- for each replicated database, there is a "log reader" process that
-reads the WAL and captures only *committed transactions* to the
-replication server. (it does not make much sense to replicate other
-things IMHO :-).
-- the replication server stores incoming data in a que ("stable
-device"), until it is sure it has reached its final destination
-- a replication server can send data to another replication server in
-a compact (read: WAN friendly) way. A chain of replication servers can
-be made, depending on network architecture)
-- the final replication server makes a almost standard client
-connection to the target database and translates the compact
-transactions back to SQL statements. By using masks, extra
-functionality can be built in.
-This kind of architecture has several advantages:
-- only committed transactions are replicated which saves overhead
-- it does not have very much impact on performance of the source
-server (apart from reading the WAL)
-- since every replication server has a stable device, data is stored
-when the network is down and nothing gets lost (nor stops performing)
-- because only the log reader and the connection from the final
-replication server are RDBMS specific, it is possible to replicate
-from MS to Oracle using a Sybase replication server (or different
-versions etc).
-I do not know how much of this is patented or copyrighted, but the
-architecture seems elegant and robust to me. I have done
-implementations of bi-directional replication too. It *is* possible
-but does require some funky setup and maintenance. (but it is better
-that letting offices on different continents working on the same
-database :-)
-just my 2 EURO cts :-)
-"Nothing is as subjective as reality"
-Reinoud van Leeuwen reinoud@xs4all.nl
----------------------------(end of broadcast)---------------------------
-TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
-From pgsql-hackers-owner+M9924@postgresql.org Mon Jun 11 18:41:51 2001
-Return-path: <pgsql-hackers-owner+M9924@postgresql.org>
-Received: from postgresql.org (webmail.postgresql.org [])
- by candle.pha.pa.us (8.10.1/8.10.1) with ESMTP id f5BMfpL28917
- for <pgman@candle.pha.pa.us>; Mon, 11 Jun 2001 18:41:51 -0400 (EDT)
-Received: from postgresql.org.org (webmail.postgresql.org [])
- by postgresql.org (8.11.3/8.11.1) with SMTP id f5BMfsE25092;
- Mon, 11 Jun 2001 18:41:54 -0400 (EDT)
- (envelope-from pgsql-hackers-owner+M9924@postgresql.org)
-Received: from spider.pilosoft.com (p55-222.acedsl.com [])
- by postgresql.org (8.11.3/8.11.1) with ESMTP id f5BMalE23024
- for <pgsql-hackers@postgresql.org>; Mon, 11 Jun 2001 18:36:47 -0400 (EDT)
- (envelope-from alex@pilosoft.com)
-Received: from localhost (alexmail@localhost)
- by spider.pilosoft.com (8.9.3/8.9.3) with ESMTP id SAA06092;
- Mon, 11 Jun 2001 18:46:05 -0400 (EDT)
-Date: Mon, 11 Jun 2001 18:46:05 -0400 (EDT)
-From: Alex Pilosov <alex@pilosoft.com>
-To: Reinoud van Leeuwen <reinoud@xs4all.nl>
-cc: pgsql-hackers@postgresql.org
-Subject: Re: [HACKERS] Postgres Replication
-In-Reply-To: <3b403d96.562404297@>
-Message-ID: <Pine.BSO.4.10.10106111828450.9902-100000@spider.pilosoft.com>
-MIME-Version: 1.0
-Content-Type: TEXT/PLAIN; charset=US-ASCII
-Precedence: bulk
-Sender: pgsql-hackers-owner@postgresql.org
-Status: OR
-On Mon, 11 Jun 2001, Reinoud van Leeuwen wrote:
-> On Mon, 11 Jun 2001 19:46:44 GMT, you wrote:
-> what does "manager controls all the transactions" mean? I hope it does
-> *not* mean that a bug in the manager would cause transactions not to
-> commit...
-Well yeah it does. Bugs are a fact of life. :)
-> >4) Based on a two phase locking approach, all dead lock situations
-> >are local and detectable by Postgres-R code base, and aborted.
-> Does this imply locking over different servers? That would mean a
-> grinding halt when a network outage occurs...
-Don't know, but see below.
-> Coming from a Sybase background I have some experience with
-> replication. The way it works in Sybase Replication server is as
-> follows:
-> - for each replicated database, there is a "log reader" process that
-> reads the WAL and captures only *committed transactions* to the
-> replication server. (it does not make much sense to replicate other
-> things IMHO :-).
-> - the replication server stores incoming data in a que ("stable
-> device"), until it is sure it has reached its final destination
-> - a replication server can send data to another replication server in
-> a compact (read: WAN friendly) way. A chain of replication servers can
-> be made, depending on network architecture)
-> - the final replication server makes a almost standard client
-> connection to the target database and translates the compact
-> transactions back to SQL statements. By using masks, extra
-> functionality can be built in.
-> This kind of architecture has several advantages:
-> - only committed transactions are replicated which saves overhead
-> - it does not have very much impact on performance of the source
-> server (apart from reading the WAL)
-> - since every replication server has a stable device, data is stored
-> when the network is down and nothing gets lost (nor stops performing)
-> - because only the log reader and the connection from the final
-> replication server are RDBMS specific, it is possible to replicate
-> from MS to Oracle using a Sybase replication server (or different
-> versions etc).
-> I do not know how much of this is patented or copyrighted, but the
-> architecture seems elegant and robust to me. I have done
-> implementations of bi-directional replication too. It *is* possible
-> but does require some funky setup and maintenance. (but it is better
-> that letting offices on different continents working on the same
-> database :-)
-Yes, the above architecture is what almost every vendor of replication
-software uses. And I'm sure if you worked much with Sybase, you hate the
-garbage that their repserver is :).
-The architecture of postgres-r and repserver are fundamentally different
-for a good reason: repserver only wants to replicate committed
-transactions, while postgres-r is more of a 'clustering' solution (albeit
-they don't say this word), and is capable to do much more than simple rep
-I.E. you can safely put half of your clients to second server in a
-replicated postgres-r cluster without being worried that a conflict (or a
-wierd locking situation) may occur.
-Try that with sybase, it is fundamentally designed for one-way
-replication, and the fact that you can do one-way replication in both
-directions doesn't mean its safe to do that!
-I'm not sure how postgres-r handles network problems. To be useful, a good
-replication solution must have an option of "no network->no updates" as
-well as "no network->queue updates and send them later". However, it is
-far easier to add queuing to a correct 'eager locking' database than it is
-to add proper locking to a queue-based replicator.
----------------------------(end of broadcast)---------------------------
-TIP 3: if posting/reading through Usenet, please send an appropriate
-subscribe-nomail command to majordomo@postgresql.org so that your
-message can get through to the mailing list cleanly
-From pgsql-hackers-owner+M9932@postgresql.org Mon Jun 11 22:17:54 2001
-Return-path: <pgsql-hackers-owner+M9932@postgresql.org>
-Received: from postgresql.org (webmail.postgresql.org [])
- by candle.pha.pa.us (8.10.1/8.10.1) with ESMTP id f5C2HsL15803
- for <pgman@candle.pha.pa.us>; Mon, 11 Jun 2001 22:17:54 -0400 (EDT)
-Received: from postgresql.org.org (webmail.postgresql.org [])
- by postgresql.org (8.11.3/8.11.1) with SMTP id f5C2HtE86836;
- Mon, 11 Jun 2001 22:17:55 -0400 (EDT)
- (envelope-from pgsql-hackers-owner+M9932@postgresql.org)
-Received: from femail15.sdc1.sfba.home.com (femail15.sdc1.sfba.home.com [])
- by postgresql.org (8.11.3/8.11.1) with ESMTP id f5C2BXE85020
- for <pgsql-hackers@postgresql.org>; Mon, 11 Jun 2001 22:11:33 -0400 (EDT)
- (envelope-from djohnson@greatbridge.com)
-Received: from greatbridge.com ([])
- by femail15.sdc1.sfba.home.com
- (InterMail vM. 201-229-121-120-20010223) with ESMTP
- id <20010612021124.OZRG17243.femail15.sdc1.sfba.home.com@greatbridge.com>;
- Mon, 11 Jun 2001 19:11:24 -0700
-Message-ID: <3B257969.6050405@greatbridge.com>
-Date: Mon, 11 Jun 2001 22:07:37 -0400
-From: Darren Johnson <djohnson@greatbridge.com>
-User-Agent: Mozilla/5.0 (Windows; U; WinNT4.0; en-US; m18) Gecko/20001108 Netscape6/6.0
-X-Accept-Language: en
-MIME-Version: 1.0
-To: Alex Pilosov <alex@pilosoft.com>, Reinoud van Leeuwen <reinoud@xs4all.nl>
-cc: pgsql-hackers@postgresql.org
-Subject: Re: [HACKERS] Postgres Replication
-References: <Pine.BSO.4.10.10106111828450.9902-100000@spider.pilosoft.com>
-Content-Type: text/plain; charset=us-ascii; format=flowed
-Content-Transfer-Encoding: 7bit
-Precedence: bulk
-Sender: pgsql-hackers-owner@postgresql.org
-Status: OR
-Thanks for the feedback. I'll try to address both your issues here.
->> what does "manager controls all the transactions" mean?
-The replication manager controls the transactions by serializing the
-write set messages.
-This ensures all transactions are committed in the same order on each
-server, so bugs
-here are not allowed ;-)
->> I hope it does
->> *not* mean that a bug in the manager would cause transactions not to
->> commit...
-> Well yeah it does. Bugs are a fact of life. :
->>> 4) Based on a two phase locking approach, all dead lock situations
->>> are local and detectable by Postgres-R code base, and aborted.
->> Does this imply locking over different servers? That would mean a
->> grinding halt when a network outage occurs...
-> Don't know, but see below.
-There is a branch of the Postgres-R code that has some failure detection
-so we will have to merge this functionality with the version of
-Postgres-R we have, and
-test this issue. I'll let you the results.
->> - the replication server stores incoming data in a que ("stable
->> device"), until it is sure it has reached its final destination
-I like this idea for recovering servers that have been down a short
-period of time, using WAL
-to recover transactions missed during the outage.
->> This kind of architecture has several advantages:
->> - only committed transactions are replicated which saves overhead
->> - it does not have very much impact on performance of the source
->> server (apart from reading the WAL)
->> - since every replication server has a stable device, data is stored
->> when the network is down and nothing gets lost (nor stops performing)
->> - because only the log reader and the connection from the final
->> replication server are RDBMS specific, it is possible to replicate
->> from MS to Oracle using a Sybase replication server (or different
->> versions etc).
-There are some issues with the "log reader" approach:
-1) The databases are not synchronized until the log reader completes its
-2) I'm not sure about Sybase, but the log reader sends SQL statements to
-the other servers
-which are then parsed, planned and executed. This over head could be
-avoided if only
-the tuple changes are replicated.
-3) Works fine for read only situations, but peer-to-peer applications
-using this approach
-must be designed with a conflict resolution scheme.
-Don't get me wrong, I believe we can learn from the replication
-techniques used by commercial
-databases like Sybase, and try to implement the good ones into
-PostgreSQL. Postgres-R is
-a synchronous approach which out performs the traditional approaches to
-synchronous replication.
-Being based on PostgreSQL-6.4.2, getting this approach in the 7.2 tree
-might be better than
-reinventing the wheel.
-Thanks again,
-Thanks again,
----------------------------(end of broadcast)---------------------------
-TIP 6: Have you searched our list archives?
-From pgsql-hackers-owner+M9936@postgresql.org Tue Jun 12 03:22:51 2001
-Return-path: <pgsql-hackers-owner+M9936@postgresql.org>
-Received: from postgresql.org (webmail.postgresql.org [])
- by candle.pha.pa.us (8.10.1/8.10.1) with ESMTP id f5C7MoL11061
- for <pgman@candle.pha.pa.us>; Tue, 12 Jun 2001 03:22:50 -0400 (EDT)
-Received: from postgresql.org.org (webmail.postgresql.org [])
- by postgresql.org (8.11.3/8.11.1) with SMTP id f5C7MPE35441;
- Tue, 12 Jun 2001 03:22:25 -0400 (EDT)
- (envelope-from pgsql-hackers-owner+M9936@postgresql.org)
-Received: from reorxrsm.server.lan.at (zep3.it-austria.net [])
- by postgresql.org (8.11.3/8.11.1) with ESMTP id f5C72ZE25009
- for <pgsql-hackers@postgresql.org>; Tue, 12 Jun 2001 03:02:36 -0400 (EDT)
- (envelope-from ZeugswetterA@wien.spardat.at)
-Received: from gz0153.gc.spardat.at (gz0153.gc.spardat.at [])
- by reorxrsm.server.lan.at (8.11.2/8.11.2) with ESMTP id f5C72Qu27966
- for <pgsql-hackers@postgresql.org>; Tue, 12 Jun 2001 09:02:26 +0200
-Received: by sdexcgtw01.f000.d0188.sd.spardat.at with Internet Mail Service (5.5.2650.21)
- id <M3L15341>; Tue, 12 Jun 2001 09:02:21 +0200
-Message-ID: <11C1E6749A55D411A9670001FA68796336831B@sdexcsrv1.f000.d0188.sd.spardat.at>
-From: Zeugswetter Andreas SB <ZeugswetterA@wien.spardat.at>
-To: "'Darren Johnson'" <djohnson@greatbridge.com>,
- pgsql-hackers@postgresql.org
-Subject: AW: [HACKERS] Postgres Replication
-Date: Tue, 12 Jun 2001 09:02:20 +0200
-MIME-Version: 1.0
-X-Mailer: Internet Mail Service (5.5.2650.21)
-Content-Type: text/plain;
- charset="iso-8859-1"
-Precedence: bulk
-Sender: pgsql-hackers-owner@postgresql.org
-Status: OR
-> Although
-> Postgres-R is a synchronous approach, I believe it is the closest to
-> the goal mentioned above. Here is an abstract of the advantages.
-If you only want synchronous replication, why not simply use triggers ?
-All you would then need is remote query access and two phase commit,
-and maybe a little script that helps create the appropriate triggers.
-Doing a replicate all or nothing approach that only works synchronous
-is imho not flexible enough.
----------------------------(end of broadcast)---------------------------
-TIP 6: Have you searched our list archives?
-From pgsql-hackers-owner+M9945@postgresql.org Tue Jun 12 10:18:29 2001
-Return-path: <pgsql-hackers-owner+M9945@postgresql.org>
-Received: from postgresql.org (webmail.postgresql.org [])
- by candle.pha.pa.us (8.10.1/8.10.1) with ESMTP id f5CEISL06372
- for <pgman@candle.pha.pa.us>; Tue, 12 Jun 2001 10:18:28 -0400 (EDT)
-Received: from postgresql.org.org (webmail.postgresql.org [])
- by postgresql.org (8.11.3/8.11.1) with SMTP id f5CEIQE77517;
- Tue, 12 Jun 2001 10:18:26 -0400 (EDT)
- (envelope-from pgsql-hackers-owner+M9945@postgresql.org)
-Received: from krypton.netropolis.org ([])
- by postgresql.org (8.11.3/8.11.1) with ESMTP id f5CEDuE75514
- for <pgsql-hackers@postgresql.org>; Tue, 12 Jun 2001 10:13:56 -0400 (EDT)
- (envelope-from root@generalogic.com)
-Received: from [] (helo=localhost)
- by krypton.netropolis.org with esmtp (Exim 3.12 #1 (Debian))
- id 159ouq-0003MU-00
- for <pgsql-hackers@postgresql.org>; Tue, 12 Jun 2001 10:13:08 -0400
-To: pgsql-hackers@postgresql.org
-Subject: Re: AW: [HACKERS] Postgres Replication
-In-Reply-To: <20010612.13321600@j2.us.greatbridge.com>
-References: <Pine.BSF.4.33.0106120605130.411-100000@mobile.hub.org>
- <20010612.13321600@j2.us.greatbridge.com>
-X-Mailer: Mew version 1.94.2 on Emacs 20.7 / Mule 4.0 (HANANOEN)
-MIME-Version: 1.0
-Content-Type: Text/Plain; charset=us-ascii
-Content-Transfer-Encoding: 7bit
-Message-ID: <20010612123623O.root@generalogic.com>
-Date: Tue, 12 Jun 2001 12:36:23 +0530
-From: root <root@generalogic.com>
-X-Dispatcher: imput version 20000414(IM141)
-Lines: 47
-Precedence: bulk
-Sender: pgsql-hackers-owner@postgresql.org
-Status: OR
-I have hacked up a replication layer for Perl code accessing a
-database throught the DBI interface. It works pretty well with MySQL
-(I can run pre-bender slashcode replicated, haven't tried the more
-recent releases).
-Potentially this hack should also work with Pg but I haven't tried
-yet. If someone would like to test it out with a complex Pg app and
-let me know how it went that would be cool.
-The replication layer is based on Eric Newton's Recall replication
-library (www.fault-tolerant.org/recall), and requires that all
-database accesses be through the DBI interface.
-The replicas are live, in that every operation affects all the
-replicas in real time. Replica outages are invisible to the user, so
-long as a majority of the replicas are functioning. Disconnected
-replicas can be used for read-only access.
-The only code modification that should be required to use the
-replication layer is to change the DSN in connect():
- my $replicas = ',,';
- my $dbh = DBI->connect("DBI:Recall:database=$replicas");
-You should be able to install the replication modules with:
-perl -MCPAN -eshell
-cpan> install Replication::Recall::DBServer
-and then install DBD::Recall (which doesn't seem to be accessible from
-the CPAN shell yet, for some reason), by:
-wget http://www.cpan.org/authors/id/AGUL/DBD-Recall-1.10.tar.gz
-tar xzvf DBD-Recall-1.10.tar.gz
-cd DBD-Recall-1.10
-perl Makefile.PL
-make install
-I would be very interested in hearing about your experiences with
----------------------------(end of broadcast)---------------------------
-TIP 3: if posting/reading through Usenet, please send an appropriate
-subscribe-nomail command to majordomo@postgresql.org so that your
-message can get through to the mailing list cleanly
-From pgsql-hackers-owner+M9938@postgresql.org Tue Jun 12 05:12:54 2001
-Return-path: <pgsql-hackers-owner+M9938@postgresql.org>
-Received: from postgresql.org (webmail.postgresql.org [])
- by candle.pha.pa.us (8.10.1/8.10.1) with ESMTP id f5C9CrL15228
- for <pgman@candle.pha.pa.us>; Tue, 12 Jun 2001 05:12:53 -0400 (EDT)
-Received: from postgresql.org.org (webmail.postgresql.org [])
- by postgresql.org (8.11.3/8.11.1) with SMTP id f5C9CnE91297;
- Tue, 12 Jun 2001 05:12:49 -0400 (EDT)
- (envelope-from pgsql-hackers-owner+M9938@postgresql.org)
-Received: from mobile.hub.org (SHW39-29.accesscable.net [])
- by postgresql.org (8.11.3/8.11.1) with ESMTP id f5C98DE89175
- for <pgsql-hackers@postgresql.org>; Tue, 12 Jun 2001 05:08:13 -0400 (EDT)
- (envelope-from scrappy@hub.org)
-Received: from localhost (scrappy@localhost)
- by mobile.hub.org (8.11.3/8.11.1) with ESMTP id f5C97f361630;
- Tue, 12 Jun 2001 06:07:46 -0300 (ADT)
- (envelope-from scrappy@hub.org)
-X-Authentication-Warning: mobile.hub.org: scrappy owned process doing -bs
-Date: Tue, 12 Jun 2001 06:07:41 -0300 (ADT)
-From: The Hermit Hacker <scrappy@hub.org>
-To: Zeugswetter Andreas SB <ZeugswetterA@wien.spardat.at>
-cc: "'Darren Johnson'" <djohnson@greatbridge.com>,
- <pgsql-hackers@postgresql.org>
-Subject: Re: AW: [HACKERS] Postgres Replication
-In-Reply-To: <11C1E6749A55D411A9670001FA68796336831B@sdexcsrv1.f000.d0188.sd.spardat.at>
-Message-ID: <Pine.BSF.4.33.0106120605130.411-100000@mobile.hub.org>
-MIME-Version: 1.0
-Content-Type: TEXT/PLAIN; charset=US-ASCII
-Precedence: bulk
-Sender: pgsql-hackers-owner@postgresql.org
-Status: OR
-which I believe is what the rserv implementation in contrib currently does
-... no?
-its funny ... what is in contrib right now was developed in a weekend by
-Vadim, put in contrib, yet nobody has either used it *or* seen fit to
-submit patches to improve it ... ?
-On Tue, 12 Jun 2001, Zeugswetter Andreas SB wrote:
-> > Although
-> > Postgres-R is a synchronous approach, I believe it is the closest to
-> > the goal mentioned above. Here is an abstract of the advantages.
-> If you only want synchronous replication, why not simply use triggers ?
-> All you would then need is remote query access and two phase commit,
-> and maybe a little script that helps create the appropriate triggers.
-> Doing a replicate all or nothing approach that only works synchronous
-> is imho not flexible enough.
-> Andreas
-> ---------------------------(end of broadcast)---------------------------
-> TIP 6: Have you searched our list archives?
-> http://www.postgresql.org/search.mpl
-Marc G. Fournier ICQ#7615664 IRC Nick: Scrappy
-Systems Administrator @ hub.org
-primary: scrappy@hub.org secondary: scrappy@{freebsd|postgresql}.org
----------------------------(end of broadcast)---------------------------
-TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
-From pgsql-hackers-owner+M9940@postgresql.org Tue Jun 12 09:39:08 2001
-Return-path: <pgsql-hackers-owner+M9940@postgresql.org>
-Received: from postgresql.org (webmail.postgresql.org [])
- by candle.pha.pa.us (8.10.1/8.10.1) with ESMTP id f5CDd8L03200
- for <pgman@candle.pha.pa.us>; Tue, 12 Jun 2001 09:39:08 -0400 (EDT)
-Received: from postgresql.org.org (webmail.postgresql.org [])
- by postgresql.org (8.11.3/8.11.1) with SMTP id f5CDcmE58175;
- Tue, 12 Jun 2001 09:38:48 -0400 (EDT)
- (envelope-from pgsql-hackers-owner+M9940@postgresql.org)
-Received: from mail.greatbridge.com (mail.greatbridge.com [])
- by postgresql.org (8.11.3/8.11.1) with ESMTP id f5CDYAE56164
- for <pgsql-hackers@postgresql.org>; Tue, 12 Jun 2001 09:34:10 -0400 (EDT)
- (envelope-from djohnson@greatbridge.com)
-Received: from j2.us.greatbridge.com (djohnsonpc.us.greatbridge.com [])
- by mail.greatbridge.com (8.11.2/8.11.2) with SMTP id f5CDXeQ03585;
- Tue, 12 Jun 2001 09:33:40 -0400
-From: Darren Johnson <djohnson@greatbridge.com>
-Date: Tue, 12 Jun 2001 13:32:16 GMT
-Message-ID: <20010612.13321600@j2.us.greatbridge.com>
-Subject: Re: AW: [HACKERS] Postgres Replication
-To: The Hermit Hacker <scrappy@hub.org>
-cc: Zeugswetter Andreas SB <ZeugswetterA@wien.spardat.at>,
- <pgsql-hackers@postgresql.org>
-Reply-To: Darren Johnson <djohnson@greatbridge.com>
-In-Reply-To: <Pine.BSF.4.33.0106120605130.411-100000@mobile.hub.org>
-References: <Pine.BSF.4.33.0106120605130.411-100000@mobile.hub.org>
-X-Mailer: Mozilla/3.0 (compatible; StarOffice/5.2;Linux)
-X-Priority: 3 (Normal)
-MIME-Version: 1.0
-Content-Type: text/plain; charset=ISO-8859-1
-Content-Transfer-Encoding: 8bit
-X-MIME-Autoconverted: from quoted-printable to 8bit by postgresql.org id f5CDYAE56166
-Precedence: bulk
-Sender: pgsql-hackers-owner@postgresql.org
-Status: OR
-> which I believe is what the rserv implementation in contrib currently
-> ... no?
-We tried rserv, PG Link (Joseph Conway), and PosrgreSQL Replicator. All
-these projects are trigger based asynchronous replication. They all have
-some advantages over the current functionality of Postgres-R some of
-which I believe can be addressed:
-1) Partial replication - being able to replicate just one or part of a
-2) They make no changes to the PostgreSQL code base. (Postgres-R can't
-address this one ;)
-3) PostgreSQL Replicator has some very nice conflict resolution schemes.
-Here are some disadvantages to using a "trigger based" approach:
-1) Triggers simply transfer individual data items when they are modified,
-they do not keep track of transactions.
-2) The execution of triggers within a database imposes a performance
-overhead to that database.
-3) Triggers require careful management by database administrators.
-Someone needs to keep track of all the "alarms" going off.
-4) The activation of triggers in a database cannot be easily
-rolled back or undone.
-> On Tue, 12 Jun 2001, Zeugswetter Andreas SB wrote:
-> > Doing a replicate all or nothing approach that only works synchronous
-> > is imho not flexible enough.
-> >
-I agree. Partial and asynchronous replication need to be addressed,
-and some of the common functionality of Postgres-R could possibly
-be used to meet those needs.
-Thanks for your feedback,
----------------------------(end of broadcast)---------------------------
-TIP 5: Have you checked our extensive FAQ?
-From pgsql-hackers-owner+M9969@postgresql.org Tue Jun 12 16:53:45 2001
-Return-path: <pgsql-hackers-owner+M9969@postgresql.org>
-Received: from postgresql.org (webmail.postgresql.org [])
- by candle.pha.pa.us (8.10.1/8.10.1) with ESMTP id f5CKriL23104
- for <pgman@candle.pha.pa.us>; Tue, 12 Jun 2001 16:53:44 -0400 (EDT)
-Received: from postgresql.org.org (webmail.postgresql.org [])
- by postgresql.org (8.11.3/8.11.1) with SMTP id f5CKrlE87423;
- Tue, 12 Jun 2001 16:53:47 -0400 (EDT)
- (envelope-from pgsql-hackers-owner+M9969@postgresql.org)
-Received: from sectorbase2.sectorbase.com (sectorbase2.sectorbase.com [] (may be forged))
- by postgresql.org (8.11.3/8.11.1) with SMTP id f5CHWkE69562
- for <pgsql-hackers@postgresql.org>; Tue, 12 Jun 2001 13:32:46 -0400 (EDT)
- (envelope-from vmikheev@SECTORBASE.COM)
-Received: by sectorbase2.sectorbase.com with Internet Mail Service (5.5.2653.19)
- id <MX6MWMV8>; Tue, 12 Jun 2001 10:30:29 -0700
-Message-ID: <3705826352029646A3E91C53F7189E32016670@sectorbase2.sectorbase.com>
-From: "Mikheev, Vadim" <vmikheev@SECTORBASE.COM>
-To: "'Darren Johnson'" <djohnson@greatbridge.com>,
- The Hermit Hacker
- <scrappy@hub.org>
-cc: Zeugswetter Andreas SB <ZeugswetterA@wien.spardat.at>,
- pgsql-hackers@postgresql.org
-Subject: RE: AW: [HACKERS] Postgres Replication
-Date: Tue, 12 Jun 2001 10:30:27 -0700
-MIME-Version: 1.0
-X-Mailer: Internet Mail Service (5.5.2653.19)
-Content-Type: text/plain;
- charset="iso-8859-1"
-Precedence: bulk
-Sender: pgsql-hackers-owner@postgresql.org
-Status: OR
-> Here are some disadvantages to using a "trigger based" approach:
-> 1) Triggers simply transfer individual data items when they
-> are modified, they do not keep track of transactions.
-I don't know about other *async* replication engines but Rserv
-keeps track of transactions (if I understood you corectly).
-Rserv transfers not individual modified data items but
-*consistent* snapshot of changes to move slave database from
-one *consistent* state (when all RI constraints satisfied)
-to another *consistent* state.
-> 4) The activation of triggers in a database cannot be easily
-> rolled back or undone.
-What do you mean?
----------------------------(end of broadcast)---------------------------
-TIP 2: you can get off all lists at once with the unregister command
- (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
-From pgsql-hackers-owner+M9967@postgresql.org Tue Jun 12 16:42:11 2001
-Return-path: <pgsql-hackers-owner+M9967@postgresql.org>
-Received: from postgresql.org (webmail.postgresql.org [])
- by candle.pha.pa.us (8.10.1/8.10.1) with ESMTP id f5CKgBL17982
- for <pgman@candle.pha.pa.us>; Tue, 12 Jun 2001 16:42:11 -0400 (EDT)
-Received: from postgresql.org.org (webmail.postgresql.org [])
- by postgresql.org (8.11.3/8.11.1) with SMTP id f5CKgDE80566;
- Tue, 12 Jun 2001 16:42:13 -0400 (EDT)
- (envelope-from pgsql-hackers-owner+M9967@postgresql.org)
-Received: from mail.greatbridge.com (mail.greatbridge.com [])
- by postgresql.org (8.11.3/8.11.1) with ESMTP id f5CIVdE07561
- for <pgsql-hackers@postgresql.org>; Tue, 12 Jun 2001 14:31:39 -0400 (EDT)
- (envelope-from djohnson@greatbridge.com)
-Received: from j2.us.greatbridge.com (djohnsonpc.us.greatbridge.com [])
- by mail.greatbridge.com (8.11.2/8.11.2) with SMTP id f5CIUfQ10080;
- Tue, 12 Jun 2001 14:30:41 -0400
-From: Darren Johnson <djohnson@greatbridge.com>
-Date: Tue, 12 Jun 2001 18:29:20 GMT
-Message-ID: <20010612.18292000@j2.us.greatbridge.com>
-Subject: RE: AW: [HACKERS] Postgres Replication
-To: "Mikheev, Vadim" <vmikheev@SECTORBASE.COM>
-cc: The Hermit Hacker <scrappy@hub.org>,
- Zeugswetter Andreas SB
- <ZeugswetterA@wien.spardat.at>,
- pgsql-hackers@postgresql.org
-Reply-To: Darren Johnson <djohnson@greatbridge.com>
- <3705826352029646A3E91C53F7189E32016670@sectorbase2.sectorbase.com>
-References: <3705826352029646A3E91C53F7189E32016670@sectorbase2.sectorbase.com>
-X-Mailer: Mozilla/3.0 (compatible; StarOffice/5.2;Linux)
-X-Priority: 3 (Normal)
-MIME-Version: 1.0
-Content-Type: text/plain; charset=ISO-8859-1
-Content-Transfer-Encoding: 8bit
-X-MIME-Autoconverted: from quoted-printable to 8bit by postgresql.org id f5CIVdE07562
-Precedence: bulk
-Sender: pgsql-hackers-owner@postgresql.org
-Status: OR
-> > Here are some disadvantages to using a "trigger based" approach:
-> >
-> > 1) Triggers simply transfer individual data items when they
-> > are modified, they do not keep track of transactions.
-> I don't know about other *async* replication engines but Rserv
-> keeps track of transactions (if I understood you corectly).
-> Rserv transfers not individual modified data items but
-> *consistent* snapshot of changes to move slave database from
-> one *consistent* state (when all RI constraints satisfied)
-> to another *consistent* state.
-I thought Andreas did a good job of correcting me here. Transaction-
-based replication with triggers do not apply to points 1 and 4. I
-should have made a distinction between non-transaction and
-transaction based replication with triggers. I was not trying to
-single out rserv or any other project, and I can see how my wording
-implies this misinterpretation (my apologies).
-> > 4) The activation of triggers in a database cannot be easily
-> > rolled back or undone.
-> What do you mean?
-Once the trigger fires, it is not an easy task to abort that
-execution via rollback or undo. Again this is not an issue
-with a transaction-based trigger approach.
----------------------------(end of broadcast)---------------------------
-TIP 2: you can get off all lists at once with the unregister command
- (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
-From pgsql-hackers-owner+M9943@postgresql.org Tue Jun 12 10:03:02 2001
-Return-path: <pgsql-hackers-owner+M9943@postgresql.org>
-Received: from postgresql.org (webmail.postgresql.org [])
- by candle.pha.pa.us (8.10.1/8.10.1) with ESMTP id f5CE32L04619
- for <pgman@candle.pha.pa.us>; Tue, 12 Jun 2001 10:03:02 -0400 (EDT)
-Received: from postgresql.org.org (webmail.postgresql.org [])
- by postgresql.org (8.11.3/8.11.1) with SMTP id f5CE31E70430;
- Tue, 12 Jun 2001 10:03:01 -0400 (EDT)
- (envelope-from pgsql-hackers-owner+M9943@postgresql.org)
-Received: from fizbanrsm.server.lan.at (zep4.it-austria.net [])
- by postgresql.org (8.11.3/8.11.1) with ESMTP id f5CDoQE64062
- for <pgsql-hackers@postgresql.org>; Tue, 12 Jun 2001 09:50:26 -0400 (EDT)
- (envelope-from ZeugswetterA@wien.spardat.at)
-Received: from gz0153.gc.spardat.at (gz0153.gc.spardat.at [])
- by fizbanrsm.server.lan.at (8.11.2/8.11.2) with ESMTP id f5CDoJe11224
- for <pgsql-hackers@postgresql.org>; Tue, 12 Jun 2001 15:50:19 +0200
-Received: by sdexcgtw01.f000.d0188.sd.spardat.at with Internet Mail Service (5.5.2650.21)
- id <M3L15S4T>; Tue, 12 Jun 2001 15:50:15 +0200
-Message-ID: <11C1E6749A55D411A9670001FA68796336831F@sdexcsrv1.f000.d0188.sd.spardat.at>
-From: Zeugswetter Andreas SB <ZeugswetterA@wien.spardat.at>
-To: "'Darren Johnson'" <djohnson@greatbridge.com>,
- The Hermit Hacker
- <scrappy@hub.org>
-cc: pgsql-hackers@postgresql.org
-Subject: AW: AW: [HACKERS] Postgres Replication
-Date: Tue, 12 Jun 2001 15:50:09 +0200
-MIME-Version: 1.0
-X-Mailer: Internet Mail Service (5.5.2650.21)
-Content-Type: text/plain;
- charset="iso-8859-1"
-Precedence: bulk
-Sender: pgsql-hackers-owner@postgresql.org
-Status: OR
-> Here are some disadvantages to using a "trigger based" approach:
-> 1) Triggers simply transfer individual data items when they
-> are modified, they do not keep track of transactions.
-> 2) The execution of triggers within a database imposes a performance
-> overhead to that database.
-> 3) Triggers require careful management by database administrators.
-> Someone needs to keep track of all the "alarms" going off.
-> 4) The activation of triggers in a database cannot be easily
-> rolled back or undone.
-Yes, points 2 and 3 are a given, although point 2 buys you the functionality
-of transparent locking across all involved db servers.
-Points 1 and 4 are only the case for a trigger mechanism that does
-not use remote connection and 2-phase commit.
-Imho an implementation that opens a separate client connection to the
-replication target is only suited for async replication, and for that a WAL
-based solution would probably impose less overhead.
----------------------------(end of broadcast)---------------------------
-TIP 2: you can get off all lists at once with the unregister command
- (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
-From pgsql-hackers-owner+M9946@postgresql.org Tue Jun 12 10:47:09 2001
-Return-path: <pgsql-hackers-owner+M9946@postgresql.org>
-Received: from postgresql.org (webmail.postgresql.org [])
- by candle.pha.pa.us (8.10.1/8.10.1) with ESMTP id f5CEl9L08144
- for <pgman@candle.pha.pa.us>; Tue, 12 Jun 2001 10:47:09 -0400 (EDT)
-Received: from postgresql.org.org (webmail.postgresql.org [])
- by postgresql.org (8.11.3/8.11.1) with SMTP id f5CEihE88714;
- Tue, 12 Jun 2001 10:44:43 -0400 (EDT)
- (envelope-from pgsql-hackers-owner+M9946@postgresql.org)
-Received: from mail.greatbridge.com (mail.greatbridge.com [])
- by postgresql.org (8.11.3/8.11.1) with ESMTP id f5CEd6E85859
- for <pgsql-hackers@postgresql.org>; Tue, 12 Jun 2001 10:39:06 -0400 (EDT)
- (envelope-from djohnson@greatbridge.com)
-Received: from j2.us.greatbridge.com (djohnsonpc.us.greatbridge.com [])
- by mail.greatbridge.com (8.11.2/8.11.2) with SMTP id f5CEcgQ04905;
- Tue, 12 Jun 2001 10:38:42 -0400
-From: Darren Johnson <djohnson@greatbridge.com>
-Date: Tue, 12 Jun 2001 14:37:18 GMT
-Message-ID: <20010612.14371800@j2.us.greatbridge.com>
-Subject: Re: AW: AW: [HACKERS] Postgres Replication
-To: Zeugswetter Andreas SB <ZeugswetterA@wien.spardat.at>
-cc: pgsql-hackers@postgresql.org
-Reply-To: Darren Johnson <djohnson@greatbridge.com>
- <11C1E6749A55D411A9670001FA68796336831F@sdexcsrv1.f000.d0188.sd.spardat.at>
-References: <11C1E6749A55D411A9670001FA68796336831F@sdexcsrv1.f000.d0188.sd.spardat.at>
-X-Mailer: Mozilla/3.0 (compatible; StarOffice/5.2;Linux)
-X-Priority: 3 (Normal)
-MIME-Version: 1.0
-Content-Type: text/plain; charset=ISO-8859-1
-Content-Transfer-Encoding: 8bit
-X-MIME-Autoconverted: from quoted-printable to 8bit by postgresql.org id f5CEd6E85860
-Precedence: bulk
-Sender: pgsql-hackers-owner@postgresql.org
-Status: OR
-> Imho an implementation that opens a separate client connection to the
-> replication target is only suited for async replication, and for that a
-> based solution would probably impose less overhead.
-Yes there is significant overhead with opening a connection to a
-client, so Postgres-R creates a pool of backends at start up,
-coupled with the group communication system (Ensemble) that
-significantly reduces this issue.
-Very good points,
----------------------------(end of broadcast)---------------------------
-TIP 6: Have you searched our list archives?
-From pgsql-hackers-owner+M9982@postgresql.org Tue Jun 12 19:04:06 2001
-Return-path: <pgsql-hackers-owner+M9982@postgresql.org>
-Received: from postgresql.org (webmail.postgresql.org [])
- by candle.pha.pa.us (8.10.1/8.10.1) with ESMTP id f5CN46E10043
- for <pgman@candle.pha.pa.us>; Tue, 12 Jun 2001 19:04:06 -0400 (EDT)
-Received: from postgresql.org.org (webmail.postgresql.org [])
- by postgresql.org (8.11.3/8.11.1) with SMTP id f5CN4AE62160;
- Tue, 12 Jun 2001 19:04:10 -0400 (EDT)
- (envelope-from pgsql-hackers-owner+M9982@postgresql.org)
-Received: from spoetnik.xs4all.nl (spoetnik.xs4all.nl [])
- by postgresql.org (8.11.3/8.11.1) with ESMTP id f5CMxaE60194
- for <pgsql-hackers@postgresql.org>; Tue, 12 Jun 2001 18:59:36 -0400 (EDT)
- (envelope-from reinoud@xs4all.nl)
-Received: from KAYAK (kayak [])
- by spoetnik.xs4all.nl (Postfix) with SMTP id 435353E1B
- for <pgsql-hackers@postgresql.org>; Wed, 13 Jun 2001 00:59:28 +0200 (CEST)
-From: reinoud@xs4all.nl (Reinoud van Leeuwen)
-To: pgsql-hackers@postgresql.org
-Subject: Re: AW: AW: [HACKERS] Postgres Replication
-Date: Tue, 12 Jun 2001 22:59:23 GMT
-Organization: Not organized in any way
-Reply-To: reinoud@xs4all.nl
-Message-ID: <3b499c5b.652202125@>
-References: <11C1E6749A55D411A9670001FA68796336831F@sdexcsrv1.f000.d0188.sd.spardat.at>
-In-Reply-To: <11C1E6749A55D411A9670001FA68796336831F@sdexcsrv1.f000.d0188.sd.spardat.at>
-X-Mailer: Forte Agent 1.5/32.451
-MIME-Version: 1.0
-Content-Type: text/plain; charset=us-ascii
-Content-Transfer-Encoding: 8bit
-X-MIME-Autoconverted: from quoted-printable to 8bit by postgresql.org id f5CMxcE60196
-Precedence: bulk
-Sender: pgsql-hackers-owner@postgresql.org
-Status: OR
-On Tue, 12 Jun 2001 15:50:09 +0200, you wrote:
->> Here are some disadvantages to using a "trigger based" approach:
->> 1) Triggers simply transfer individual data items when they
->> are modified, they do not keep track of transactions.
->> 2) The execution of triggers within a database imposes a performance
->> overhead to that database.
->> 3) Triggers require careful management by database administrators.
->> Someone needs to keep track of all the "alarms" going off.
->> 4) The activation of triggers in a database cannot be easily
->> rolled back or undone.
->Yes, points 2 and 3 are a given, although point 2 buys you the functionality
->of transparent locking across all involved db servers.
->Points 1 and 4 are only the case for a trigger mechanism that does
->not use remote connection and 2-phase commit.
->Imho an implementation that opens a separate client connection to the
->replication target is only suited for async replication, and for that a WAL
->based solution would probably impose less overhead.
-Well as I read back the thread I see 2 different approaches to
-1: tight integrated replication.
-- bi-directional (or multidirectional): updates are possible
-- A cluster of servers allways has the same state.
-- it does not matter to which server you connect
-- network between servers will be a bottleneck, especially if it is a
-WAN connection
-- only full replication possible
-- what happens if one server is down? (or the network between) are
-commits still possible
-2: async replication
-- long distance possible
-- no problems with network outages
-- only changes are replicated, selects do not have impact
-- no locking issues accross servers
-- partial replication possible (many->one (datawarehouse), or one-many
-(queries possible everywhere, updates only central)
-- goof for failover situations (backup server is standing by)
-- bidirectional replication hard to set up (you'll have to implement
-conflict resolution according to your business rules)
-- different servers are not guaranteed to be in the same state.
-I can think of some scenarios where I would definitely want to
-*choose* one of the options. A load-balanced web environment would
-likely want the first option, but synchronizing offices in different
-continents might not work with 2-phase commit over the network....
-And we have not even started talking about *managing* replicated
-environments. A lot of fail-over scenarios stop planning after the
-backup host has take control. But how to get back?
-"Nothing is as subjective as reality"
-Reinoud van Leeuwen reinoud@xs4all.nl
----------------------------(end of broadcast)---------------------------
-TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
-From pgsql-hackers-owner+M9986@postgresql.org Tue Jun 12 19:48:48 2001
-Return-path: <pgsql-hackers-owner+M9986@postgresql.org>
-Received: from postgresql.org (webmail.postgresql.org [])
- by candle.pha.pa.us (8.10.1/8.10.1) with ESMTP id f5CNmmE13125
- for <pgman@candle.pha.pa.us>; Tue, 12 Jun 2001 19:48:48 -0400 (EDT)
-Received: from postgresql.org.org (webmail.postgresql.org [])
- by postgresql.org (8.11.3/8.11.1) with SMTP id f5CNmqE76673;
- Tue, 12 Jun 2001 19:48:52 -0400 (EDT)
- (envelope-from pgsql-hackers-owner+M9986@postgresql.org)
-Received: from sss.pgh.pa.us ([])
- by postgresql.org (8.11.3/8.11.1) with ESMTP id f5CNdQE73923
- for <pgsql-hackers@postgresql.org>; Tue, 12 Jun 2001 19:39:26 -0400 (EDT)
- (envelope-from tgl@sss.pgh.pa.us)
-Received: from sss2.sss.pgh.pa.us (tgl@localhost [])
- by sss.pgh.pa.us (8.11.3/8.11.3) with ESMTP id f5CNdI016442;
- Tue, 12 Jun 2001 19:39:18 -0400 (EDT)
-To: reinoud@xs4all.nl
-cc: pgsql-hackers@postgresql.org
-Subject: Re: AW: AW: [HACKERS] Postgres Replication
-In-Reply-To: <3b499c5b.652202125@>
-References: <11C1E6749A55D411A9670001FA68796336831F@sdexcsrv1.f000.d0188.sd.spardat.at> <3b499c5b.652202125@>
-Comments: In-reply-to reinoud@xs4all.nl (Reinoud van Leeuwen)
- message dated "Tue, 12 Jun 2001 22:59:23 +0000"
-Date: Tue, 12 Jun 2001 19:39:18 -0400
-Message-ID: <16439.992389158@sss.pgh.pa.us>
-From: Tom Lane <tgl@sss.pgh.pa.us>
-Precedence: bulk
-Sender: pgsql-hackers-owner@postgresql.org
-Status: OR
-reinoud@xs4all.nl (Reinoud van Leeuwen) writes:
-> Well as I read back the thread I see 2 different approaches to
-> replication:
-> ...
-> I can think of some scenarios where I would definitely want to
-> *choose* one of the options.
-Yes. IIRC, it looks to be possible to support a form of async
-replication using the Postgres-R approach: you allow the cluster
-to break apart when communications fail, and then rejoin when
-your link comes back to life. (This can work in principle, how
-close it is to reality is another question; but the rejoin operation
-is the same as crash recovery, so you have to have it anyway.)
-So this seems to me to allow getting most of the benefits of the async
-approach. OTOH it is difficult to see how to go the other way: getting
-the benefits of a synchronous solution atop a basically-async
-implementation doesn't seem like it can work.
- regards, tom lane
----------------------------(end of broadcast)---------------------------
-TIP 6: Have you searched our list archives?
-From pgsql-hackers-owner+M9997@postgresql.org Wed Jun 13 09:05:56 2001
-Return-path: <pgsql-hackers-owner+M9997@postgresql.org>
-Received: from postgresql.org (webmail.postgresql.org [])
- by candle.pha.pa.us (8.10.1/8.10.1) with ESMTP id f5DD5tE28260
- for <pgman@candle.pha.pa.us>; Wed, 13 Jun 2001 09:05:55 -0400 (EDT)
-Received: from postgresql.org.org (webmail.postgresql.org [])
- by postgresql.org (8.11.3/8.11.1) with SMTP id f5DD5xE12437;
- Wed, 13 Jun 2001 09:05:59 -0400 (EDT)
- (envelope-from pgsql-hackers-owner+M9997@postgresql.org)
-Received: from fizbanrsm.server.lan.at (zep4.it-austria.net [])
- by postgresql.org (8.11.3/8.11.1) with ESMTP id f5DD19E00635
- for <pgsql-hackers@postgresql.org>; Wed, 13 Jun 2001 09:01:10 -0400 (EDT)
- (envelope-from ZeugswetterA@wien.spardat.at)
-Received: from gz0153.gc.spardat.at (gz0153.gc.spardat.at [])
- by fizbanrsm.server.lan.at (8.11.2/8.11.2) with ESMTP id f5DD13m08153
- for <pgsql-hackers@postgresql.org>; Wed, 13 Jun 2001 15:01:03 +0200
-Received: by sdexcgtw01.f000.d0188.sd.spardat.at with Internet Mail Service (5.5.2650.21)
- id <M6AB97MY>; Wed, 13 Jun 2001 15:00:02 +0200
-Message-ID: <11C1E6749A55D411A9670001FA687963368322@sdexcsrv1.f000.d0188.sd.spardat.at>
-From: Zeugswetter Andreas SB <ZeugswetterA@wien.spardat.at>
-To: "'reinoud@xs4all.nl'" <reinoud@xs4all.nl>, pgsql-hackers@postgresql.org
-Subject: AW: AW: AW: [HACKERS] Postgres Replication
-Date: Wed, 13 Jun 2001 11:55:48 +0200
-MIME-Version: 1.0
-X-Mailer: Internet Mail Service (5.5.2650.21)
-Content-Type: text/plain;
- charset="iso-8859-1"
-Precedence: bulk
-Sender: pgsql-hackers-owner@postgresql.org
-Status: OR
-> Well as I read back the thread I see 2 different approaches to
-> replication:
-> 1: tight integrated replication.
-> pro:
-> - bi-directional (or multidirectional): updates are possible everywhere
-> - A cluster of servers allways has the same state.
-> - it does not matter to which server you connect
-> con:
-> - network between servers will be a bottleneck, especially if it is a
-> WAN connection
-> - only full replication possible
-I do not understand that point, if it is trigger based, you
-have all the flexibility you need. (only some tables, only some rows,
-different rows to different targets ....),
-(or do you mean not all targets, that could also be achieved with triggers)
-> - what happens if one server is down? (or the network between) are
-> commits still possible
-No, updates are not possible if one target is not reachable,
-that would not be synchronous and would again need business rules
-to resolve conflicts.
-Allowing updates when a target is not reachable would require admin
----------------------------(end of broadcast)---------------------------
-TIP 4: Don't 'kill -9' the postmaster
-From pgsql-hackers-owner+M10005@postgresql.org Wed Jun 13 11:15:48 2001
-Return-path: <pgsql-hackers-owner+M10005@postgresql.org>
-Received: from postgresql.org (webmail.postgresql.org [])
- by candle.pha.pa.us (8.10.1/8.10.1) with ESMTP id f5DFFmE08382
- for <pgman@candle.pha.pa.us>; Wed, 13 Jun 2001 11:15:48 -0400 (EDT)
-Received: from postgresql.org.org (webmail.postgresql.org [])
- by postgresql.org (8.11.3/8.11.1) with SMTP id f5DFFoE53621;
- Wed, 13 Jun 2001 11:15:50 -0400 (EDT)
- (envelope-from pgsql-hackers-owner+M10005@postgresql.org)
-Received: from mail.greatbridge.com (mail.greatbridge.com [])
- by postgresql.org (8.11.3/8.11.1) with ESMTP id f5DEk7E38930
- for <pgsql-hackers@postgresql.org>; Wed, 13 Jun 2001 10:46:07 -0400 (EDT)
- (envelope-from djohnson@greatbridge.com)
-Received: from j2.us.greatbridge.com (djohnsonpc.us.greatbridge.com [])
- by mail.greatbridge.com (8.11.2/8.11.2) with SMTP id f5DEhfQ22566;
- Wed, 13 Jun 2001 10:43:41 -0400
-From: Darren Johnson <djohnson@greatbridge.com>
-Date: Wed, 13 Jun 2001 14:44:11 GMT
-Message-ID: <20010613.14441100@j2.us.greatbridge.com>
-Subject: Re: AW: AW: AW: [HACKERS] Postgres Replication
-To: Zeugswetter Andreas SB <ZeugswetterA@wien.spardat.at>
-cc: "'reinoud@xs4all.nl'" <reinoud@xs4all.nl>, pgsql-hackers@postgresql.org
-Reply-To: Darren Johnson <djohnson@greatbridge.com>
- <11C1E6749A55D411A9670001FA687963368322@sdexcsrv1.f000.d0188.sd.spardat.at>
-References: <11C1E6749A55D411A9670001FA687963368322@sdexcsrv1.f000.d0188.sd.spardat.at>
-X-Mailer: Mozilla/3.0 (compatible; StarOffice/5.2;Linux)
-X-Priority: 3 (Normal)
-MIME-Version: 1.0
-Content-Type: text/plain; charset=ISO-8859-1
-Content-Transfer-Encoding: 8bit
-X-MIME-Autoconverted: from quoted-printable to 8bit by postgresql.org id f5DEk8E38931
-Precedence: bulk
-Sender: pgsql-hackers-owner@postgresql.org
-Status: OR
-> > - only full replication possible
-> I do not understand that point, if it is trigger based, you
-> have all the flexibility you need. (only some tables, only some rows,
-> different rows to different targets ....),
-> (or do you mean not all targets, that could also be achieved with
-Currently with Postgres-R, it is one database replicating all tables to
-all servers in the group communication system. There are some ways
-this by invoking the -r option when a SQL statement should be replicated,
-and leaving the -r option off for non-replicated scenarios. IMHO this is
-not a good solution.
-A better solution will need to be implemented, which involves a
-subscription table(s) with relation/server information. There are two
-ideas for subscribing and receiving replicated data.
-1) Receiver driven propagation - A simple solution where all
-transactions are propagated and the receiving servers will reference
-the subscription information before applying updates.
-2) Sender driven propagation - A more optimal and complex solution
-where servers do not receive any messages regarding data items for
-which they have not subscribed
-> > - what happens if one server is down? (or the network between) are
-> > commits still possible
-> No, updates are not possible if one target is not reachable,
-AFAIK, Postgres-R can still replicate if one target is not reachable,
-but only to the remaining servers ;).
-There is a scenario that could arise if a server issues a lock
-request then fails or goes off line. There is code that checks
-for this condition, which needs to be merged with the branch we have.
-> that would not be synchronous and would again need business rules
-> to resolve conflicts.
-Yes the failed server would not be synchronized, and getting this
-failed server back in sync needs to be addressed.
-> Allowing updates when a target is not reachable would require admin
-> intervention.
-In its current state yes, but our goal would be to eliminate this
-requirement as well.
----------------------------(end of broadcast)---------------------------
-TIP 3: if posting/reading through Usenet, please send an appropriate
-subscribe-nomail command to majordomo@postgresql.org so that your
-message can get through to the mailing list cleanly
-From pgsql-hackers-owner+M18443=candle.pha.pa.us=pgman@postgresql.org Mon Feb 4 19:16:17 2002
-Return-path: <pgsql-hackers-owner+M18443=candle.pha.pa.us=pgman@postgresql.org>
-Received: from server1.pgsql.org (www.postgresql.org [])
- by candle.pha.pa.us (8.11.6/8.10.1) with SMTP id g150GGP03822
- for <pgman@candle.pha.pa.us>; Mon, 4 Feb 2002 19:16:16 -0500 (EST)
-Received: (qmail 77444 invoked by alias); 5 Feb 2002 00:16:11 -0000
-Received: from unknown (HELO postgresql.org) (
- by www.postgresql.org with SMTP; 5 Feb 2002 00:16:11 -0000
-Received: from snoopy.mohawksoft.com (h0050bf7a618d.ne.mediaone.net [])
- by postgresql.org (8.11.3/8.11.4) with ESMTP id g150Esl77040
- for <pgsql-hackers@postgresql.org>; Mon, 4 Feb 2002 19:14:54 -0500 (EST)
- (envelope-from markw@mohawksoft.com)
-Received: from mohawksoft.com (localhost [])
- by snoopy.mohawksoft.com (8.11.6/8.11.6) with ESMTP id g150AWh08676
- for <pgsql-hackers@postgresql.org>; Mon, 4 Feb 2002 19:10:33 -0500
-Message-ID: <3C5F22F8.C9B958F0@mohawksoft.com>
-Date: Mon, 04 Feb 2002 19:10:32 -0500
-From: mlw <markw@mohawksoft.com>
-X-Mailer: Mozilla 4.78 [en] (X11; U; Linux 2.4.17 i686)
-X-Accept-Language: en
-MIME-Version: 1.0
-To: PostgreSQL-development <pgsql-hackers@postgresql.org>
-Subject: [HACKERS] Replication
-Content-Type: text/plain; charset=us-ascii
-Content-Transfer-Encoding: 7bit
-Precedence: bulk
-Sender: pgsql-hackers-owner@postgresql.org
-Status: OR
-I re-wrote RServ.pm to C, and wrote a replication daemon. It works, but it
-works like the whole rserv project. I don't like it.
-OK, what the hell do we need to do to get PostgreSQL replicating?
----------------------------(end of broadcast)---------------------------
-TIP 4: Don't 'kill -9' the postmaster
-From pgsql-hackers-owner+M18445=candle.pha.pa.us=pgman@postgresql.org Mon Feb 4 19:57:01 2002
-Return-path: <pgsql-hackers-owner+M18445=candle.pha.pa.us=pgman@postgresql.org>
-Received: from server1.pgsql.org (www.postgresql.org [])
- by candle.pha.pa.us (8.11.6/8.10.1) with SMTP id g150v0P06518
- for <pgman@candle.pha.pa.us>; Mon, 4 Feb 2002 19:57:00 -0500 (EST)
-Received: (qmail 90440 invoked by alias); 5 Feb 2002 00:56:59 -0000
-Received: from unknown (HELO postgresql.org) (
- by www.postgresql.org with SMTP; 5 Feb 2002 00:56:59 -0000
-Received: from www1.navtechinc.com ([])
- by postgresql.org (8.11.3/8.11.4) with ESMTP id g150rMl89885
- for <pgsql-hackers@postgresql.org>; Mon, 4 Feb 2002 19:53:22 -0500 (EST)
- (envelope-from ssinger@navtechinc.com)
-Received: from pcNavYkfAdm1.ykf.navtechinc.com (wall [])
- by www1.navtechinc.com (8.9.3/8.9.3) with ESMTP id AAA06047;
- Tue, 5 Feb 2002 00:53:22 GMT
-Received: from localhost (ssinger@localhost)
- by pcNavYkfAdm1.ykf.navtechinc.com (8.9.3/8.9.3) with ESMTP id AAA10675;
- Tue, 5 Feb 2002 00:52:43 GMT
-Date: Tue, 5 Feb 2002 00:52:43 +0000 (GMT)
-From: Steven <ssinger@navtechinc.com>
-X-X-Sender: <ssinger@pcNavYkfAdm1.ykf.navtechinc.com>
-To: mlw <markw@mohawksoft.com>
-cc: PostgreSQL-development <pgsql-hackers@postgresql.org>
-Subject: Re: [HACKERS] Replication
-In-Reply-To: <3C5F22F8.C9B958F0@mohawksoft.com>
-Message-ID: <Pine.LNX.4.33.0202050040190.24027-100000@pcNavYkfAdm1.ykf.navtechinc.com>
-MIME-Version: 1.0
-Content-Type: TEXT/PLAIN; charset=US-ASCII
-Precedence: bulk
-Sender: pgsql-hackers-owner@postgresql.org
-Status: OR
-On Mon, 4 Feb 2002, mlw wrote:
-I've developed a replacement for Rserv and we are planning on releasing
-it as open source(ie as a contrib module).
-Like Rserv its trigger based but its much more flexible.
-The key adventages it has over Rserv is that it has
--Support for multiple slaves
--It Perserves transactions while doing the mirroring. Ie If rows A,B are
-originally added in the same transaction they will be mirrored in the same
-We have plans on adding filtering based on data/selective mirroring as
-well. (Ie only rows with COUNTRY='Canada' go to
-slave A, and rows with COUNTRY='China' go to slave B).
-But I'm not sure when I'll get to that.
-Support for conflict resolution(If allow edits to be made on the slaves)
-would be nice.
-I hope to be able to send a tarball with the source to the pgpatches list
-within the next few days.
-We've been using the system operationally for a number of months and have
-been happy with it.
-> I re-wrote RServ.pm to C, and wrote a replication daemon. It works, but it
-> works like the whole rserv project. I don't like it.
-> OK, what the hell do we need to do to get PostgreSQL replicating?
-> ---------------------------(end of broadcast)---------------------------
-> TIP 4: Don't 'kill -9' the postmaster
-Steven Singer ssinger@navtechinc.com
-Aircraft Performance Systems Phone: 519-747-1170 ext 282
-Navtech Systems Support Inc. AFTN: CYYZXNSX SITA: YYZNSCR
-Waterloo, Ontario ARINC: YKFNSCR
----------------------------(end of broadcast)---------------------------
-TIP 2: you can get off all lists at once with the unregister command
- (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
-From pgsql-hackers-owner+M18447=candle.pha.pa.us=pgman@postgresql.org Mon Feb 4 20:06:57 2002
-Return-path: <pgsql-hackers-owner+M18447=candle.pha.pa.us=pgman@postgresql.org>
-Received: from server1.pgsql.org (www.postgresql.org [])
- by candle.pha.pa.us (8.11.6/8.10.1) with SMTP id g1516vP07508
- for <pgman@candle.pha.pa.us>; Mon, 4 Feb 2002 20:06:57 -0500 (EST)
-Received: (qmail 92753 invoked by alias); 5 Feb 2002 01:06:55 -0000
-Received: from unknown (HELO postgresql.org) (
- by www.postgresql.org with SMTP; 5 Feb 2002 01:06:55 -0000
-Received: from inflicted.crimelabs.net (crimelabs.net [])
- by postgresql.org (8.11.3/8.11.4) with ESMTP id g150vhl91978
- for <pgsql-hackers@postgresql.org>; Mon, 4 Feb 2002 19:57:44 -0500 (EST)
- (envelope-from bpalmer@crimelabs.net)
-Received: from mizer.crimelabs.net (mizer.crimelabs.net [])
- by inflicted.crimelabs.net (Postfix) with ESMTP
- id 9D6EE8779; Mon, 4 Feb 2002 19:57:46 -0500 (EST)
-Date: Mon, 4 Feb 2002 19:57:34 -0500 (EST)
-From: bpalmer <bpalmer@crimelabs.net>
-To: mlw <markw@mohawksoft.com>
-cc: PostgreSQL-development <pgsql-hackers@postgresql.org>
-Subject: Re: [HACKERS] Replication
-In-Reply-To: <3C5F22F8.C9B958F0@mohawksoft.com>
-Message-ID: <Pine.BSO.4.43.0202041955420.17121-100000@mizer.crimelabs.net>
-MIME-Version: 1.0
-Content-Type: TEXT/PLAIN; charset=US-ASCII
-Precedence: bulk
-Sender: pgsql-hackers-owner@postgresql.org
-Status: OR
-> OK, what the hell do we need to do to get PostgreSQL replicating?
-I hope you understand that replication, done right, is a massive
-project. I know that Darren any myself (and the rest of the pg-repl
-folks) have been waiting till 7.2 went gold till we did anymore work. I
-think we hope to have master / slave replicatin working for 7.3 and then
-target multimaster for 7.4. At least that's the hope.
-- Brandon
- c: 646-456-5455 h: 201-798-4983
- b. palmer, bpalmer@crimelabs.net pgp:crimelabs.net/bpalmer.pgp5
----------------------------(end of broadcast)---------------------------
-TIP 2: you can get off all lists at once with the unregister command
- (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
-From pgsql-hackers-owner+M18449=candle.pha.pa.us=pgman@postgresql.org Mon Feb 4 21:16:56 2002
-Return-path: <pgsql-hackers-owner+M18449=candle.pha.pa.us=pgman@postgresql.org>
-Received: from server1.pgsql.org (www.postgresql.org [])
- by candle.pha.pa.us (8.11.6/8.10.1) with SMTP id g152GtP10503
- for <pgman@candle.pha.pa.us>; Mon, 4 Feb 2002 21:16:55 -0500 (EST)
-Received: (qmail 6711 invoked by alias); 5 Feb 2002 02:16:53 -0000
-Received: from unknown (HELO postgresql.org) (
- by www.postgresql.org with SMTP; 5 Feb 2002 02:16:53 -0000
-Received: from snoopy.mohawksoft.com (h0050bf7a618d.ne.mediaone.net [])
- by postgresql.org (8.11.3/8.11.4) with ESMTP id g151qSl99469
- for <pgsql-hackers@postgresql.org>; Mon, 4 Feb 2002 20:52:28 -0500 (EST)
- (envelope-from markw@mohawksoft.com)
-Received: from mohawksoft.com (localhost [])
- by snoopy.mohawksoft.com (8.11.6/8.11.6) with ESMTP id g151lph09147;
- Mon, 4 Feb 2002 20:47:51 -0500
-Message-ID: <3C5F39C7.970F4549@mohawksoft.com>
-Date: Mon, 04 Feb 2002 20:47:51 -0500
-From: mlw <markw@mohawksoft.com>
-X-Mailer: Mozilla 4.78 [en] (X11; U; Linux 2.4.17 i686)
-X-Accept-Language: en
-MIME-Version: 1.0
-To: Steven <ssinger@navtechinc.com>
-cc: PostgreSQL-development <pgsql-hackers@postgresql.org>
-Subject: Re: [HACKERS] Replication
-References: <Pine.LNX.4.33.0202050040190.24027-100000@pcNavYkfAdm1.ykf.navtechinc.com>
-Content-Type: text/plain; charset=us-ascii
-Content-Transfer-Encoding: 7bit
-Precedence: bulk
-Sender: pgsql-hackers-owner@postgresql.org
-Status: OR
-Steven wrote:
-> On Mon, 4 Feb 2002, mlw wrote:
-> I've developed a replacement for Rserv and we are planning on releasing
-> it as open source(ie as a contrib module).
-> Like Rserv its trigger based but its much more flexible.
-> The key adventages it has over Rserv is that it has
-> -Support for multiple slaves
-> -It Perserves transactions while doing the mirroring. Ie If rows A,B are
-> originally added in the same transaction they will be mirrored in the same
-> transaction.
-I did a similar thing. I took the rserv trigger "as is," but rewrote the
-replication support code. What I eventually did was write a "snapshot daemon"
-which created snapshot files. Then a "slave daemon" which would check the last
-snapshot applied and apply all the snapshots, in order, as needed. One would
-run one of these daemons per slave server.
----------------------------(end of broadcast)---------------------------
-TIP 5: Have you checked our extensive FAQ?
-From pgsql-hackers-owner+M18448=candle.pha.pa.us=pgman@postgresql.org Mon Feb 4 20:57:25 2002
-Return-path: <pgsql-hackers-owner+M18448=candle.pha.pa.us=pgman@postgresql.org>
-Received: from server1.pgsql.org (www.postgresql.org [])
- by candle.pha.pa.us (8.11.6/8.10.1) with SMTP id g151vOP09239
- for <pgman@candle.pha.pa.us>; Mon, 4 Feb 2002 20:57:24 -0500 (EST)
-Received: (qmail 99828 invoked by alias); 5 Feb 2002 01:57:19 -0000
-Received: from unknown (HELO postgresql.org) (
- by www.postgresql.org with SMTP; 5 Feb 2002 01:57:19 -0000
-Received: from snoopy.mohawksoft.com (h0050bf7a618d.ne.mediaone.net [])
- by postgresql.org (8.11.3/8.11.4) with ESMTP id g151s0l99529
- for <pgsql-hackers@postgresql.org>; Mon, 4 Feb 2002 20:54:00 -0500 (EST)
- (envelope-from markw@mohawksoft.com)
-Received: from mohawksoft.com (localhost [])
- by snoopy.mohawksoft.com (8.11.6/8.11.6) with ESMTP id g151nah09156;
- Mon, 4 Feb 2002 20:49:37 -0500
-Message-ID: <3C5F3A30.A4C46FB8@mohawksoft.com>
-Date: Mon, 04 Feb 2002 20:49:36 -0500
-From: mlw <markw@mohawksoft.com>
-X-Mailer: Mozilla 4.78 [en] (X11; U; Linux 2.4.17 i686)
-X-Accept-Language: en
-MIME-Version: 1.0
-To: bpalmer <bpalmer@crimelabs.net>
-cc: PostgreSQL-development <pgsql-hackers@postgresql.org>
-Subject: Re: [HACKERS] Replication
-References: <Pine.BSO.4.43.0202041955420.17121-100000@mizer.crimelabs.net>
-Content-Type: text/plain; charset=us-ascii
-Content-Transfer-Encoding: 7bit
-Precedence: bulk
-Sender: pgsql-hackers-owner@postgresql.org
-Status: OR
-bpalmer wrote:
-> >
-> > OK, what the hell do we need to do to get PostgreSQL replicating?
-> I hope you understand that replication, done right, is a massive
-> project. I know that Darren any myself (and the rest of the pg-repl
-> folks) have been waiting till 7.2 went gold till we did anymore work. I
-> think we hope to have master / slave replicatin working for 7.3 and then
-> target multimaster for 7.4. At least that's the hope.
-I do know how hard replication is. I also understand how important it is.
-If you guys have a project going, and need developers, I am more than willing.
----------------------------(end of broadcast)---------------------------
-TIP 5: Have you checked our extensive FAQ?
-From pgsql-hackers-owner+M18450=candle.pha.pa.us=pgman@postgresql.org Mon Feb 4 21:42:13 2002
-Return-path: <pgsql-hackers-owner+M18450=candle.pha.pa.us=pgman@postgresql.org>
-Received: from server1.pgsql.org (www.postgresql.org [])
- by candle.pha.pa.us (8.11.6/8.10.1) with SMTP id g152gCP11957
- for <pgman@candle.pha.pa.us>; Mon, 4 Feb 2002 21:42:13 -0500 (EST)
-Received: (qmail 14229 invoked by alias); 5 Feb 2002 02:42:09 -0000
-Received: from unknown (HELO postgresql.org) (
- by www.postgresql.org with SMTP; 5 Feb 2002 02:42:09 -0000
-Received: from www1.navtechinc.com ([])
- by postgresql.org (8.11.3/8.11.4) with ESMTP id g152SBl10682
- for <pgsql-hackers@postgresql.org>; Mon, 4 Feb 2002 21:28:11 -0500 (EST)
- (envelope-from ssinger@navtechinc.com)
-Received: from pcNavYkfAdm1.ykf.navtechinc.com (wall [])
- by www1.navtechinc.com (8.9.3/8.9.3) with ESMTP id CAA06384;
- Tue, 5 Feb 2002 02:28:13 GMT
-Received: from localhost (ssinger@localhost)
- by pcNavYkfAdm1.ykf.navtechinc.com (8.9.3/8.9.3) with ESMTP id CAA10682;
- Tue, 5 Feb 2002 02:27:35 GMT
-Date: Tue, 5 Feb 2002 02:27:35 +0000 (GMT)
-From: Steven <ssinger@navtechinc.com>
-X-X-Sender: <ssinger@pcNavYkfAdm1.ykf.navtechinc.com>
-To: mlw <markw@mohawksoft.com>
-cc: PostgreSQL-development <pgsql-hackers@postgresql.org>
-Subject: Re: [HACKERS] Replication
-In-Reply-To: <3C5F39C7.970F4549@mohawksoft.com>
-Message-ID: <Pine.LNX.4.33.0202050159591.26756-100000@pcNavYkfAdm1.ykf.navtechinc.com>
-MIME-Version: 1.0
-Content-Type: TEXT/PLAIN; charset=US-ASCII
-Precedence: bulk
-Sender: pgsql-hackers-owner@postgresql.org
-Status: OR
-DBMirror doesn't use snapshot's instead it records a log of transactions
-that are committed to the database in a pair of tables.
-In the case of an INSERT this is the row that is being added.
-In the case of a delete the primary key of the row being deleted.
-And in the case of an UPDATE, the primary key before the update along with
-all of the data the row should have after an update.
-Then for each slave database a perl script walks though the transactions
-that are pending for that host and reconstructs SQL to send the row edits
-to that host. A record of the fact that transaction Y has been sent to
-host X is also kept.
-When transaction X has been sent to all of the hosts that are in the
-system it is then deleted from the Pending tables.
-I suspect that all of the information I'm storing in the Pending tables is
-also being stored by Postgres in its log but I haven't investigated how
-the information could be extracted(or how long it is kept for). That
-would reduce the extra storage overhead that the replication system
-As I remember(Its been a while since I've looked at it) RServ uses OID's
-in its tables to point to the data that needs to be replicated. We tried
-a similar approach but found difficulties with doing partial updates.
-On Mon, 4 Feb 2002, mlw wrote:
-> I did a similar thing. I took the rserv trigger "as is," but rewrote the
-> replication support code. What I eventually did was write a "snapshot daemon"
-> which created snapshot files. Then a "slave daemon" which would check the last
-> snapshot applied and apply all the snapshots, in order, as needed. One would
-> run one of these daemons per slave server.
-Steven Singer ssinger@navtechinc.com
-Aircraft Performance Systems Phone: 519-747-1170 ext 282
-Navtech Systems Support Inc. AFTN: CYYZXNSX SITA: YYZNSCR
-Waterloo, Ontario ARINC: YKFNSCR
----------------------------(end of broadcast)---------------------------
-TIP 2: you can get off all lists at once with the unregister command
- (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
-From pgsql-hackers-owner+M18554=candle.pha.pa.us=pgman@postgresql.org Thu Feb 7 02:49:48 2002
-Return-path: <pgsql-hackers-owner+M18554=candle.pha.pa.us=pgman@postgresql.org>
-Received: from server1.pgsql.org (www.postgresql.org [])
- by candle.pha.pa.us (8.11.6/8.10.1) with SMTP id g177nlP04347
- for <pgman@candle.pha.pa.us>; Thu, 7 Feb 2002 02:49:47 -0500 (EST)
-Received: (qmail 22556 invoked by alias); 7 Feb 2002 07:49:49 -0000
-Received: from unknown (HELO postgresql.org) (
- by www.postgresql.org with SMTP; 7 Feb 2002 07:49:49 -0000
-Received: from linuxworld.com.au (www.linuxworld.com.au [])
- by postgresql.org (8.11.3/8.11.4) with ESMTP id g177QfE19572
- for <pgsql-hackers@postgresql.org>; Thu, 7 Feb 2002 02:26:42 -0500 (EST)
- (envelope-from swm@linuxworld.com.au)
-Received: from localhost (swm@localhost)
- by linuxworld.com.au (8.11.4/8.11.4) with ESMTP id g177RiU06086;
- Thu, 7 Feb 2002 18:27:45 +1100
-Date: Thu, 7 Feb 2002 18:27:44 +1100 (EST)
-From: Gavin Sherry <swm@linuxworld.com.au>
-To: mlw <markw@mohawksoft.com>
-cc: PostgreSQL-development <pgsql-hackers@postgresql.org>
-Subject: Re: [HACKERS] Replication
-In-Reply-To: <3C5F22F8.C9B958F0@mohawksoft.com>
-Message-ID: <Pine.LNX.4.21.0202071751240.5160-100000@linuxworld.com.au>
-MIME-Version: 1.0
-Content-Type: TEXT/PLAIN; charset=US-ASCII
-Precedence: bulk
-Sender: pgsql-hackers-owner@postgresql.org
-Status: OR
-On Mon, 4 Feb 2002, mlw wrote:
-> I re-wrote RServ.pm to C, and wrote a replication daemon. It works, but it
-> works like the whole rserv project. I don't like it.
-> OK, what the hell do we need to do to get PostgreSQL replicating?
-The trigger model is not a very sophisticated one. I think I have a better
--- though more complicated -- one. This model would be able to handle
-multiple masters and master->slave.
-First of all, all machines in the cluster would have to be aware all the
-machines in the cluster. This would have to be stored in a new system
-The FE/BE protocol would need to be modified to accepted parsed node trees
-generated by pg_analyze_and_rewrite(). These could then be dispatched by
-the executing server, inside of pg_exec_query_string, to all other servers
-in the cluster (excluding itself). Naturally, this dispatch would need to
-be non-blocking.
-pg_exec_query_string() would need to check that nodetags to make sure
-selects and perhaps some commands are not dispatched.
-Before the executing server runs finish_xact_command(), it would check
-that the query was successfully executed on all machines otherwise
-abort. Such a system would need a few configuration options: whether or
-not you abort on failed replication to slaves, the ability to replicate
-only certain tables, etc.
-Naturally, this would slow down writes to the system (possibly a lot
-depending on the performance difference between the executing machine and
-the least powerful machine in the cluster), but most usages of postgresql
-are read intensive, not write.
-Any reason this model would not work?
----------------------------(end of broadcast)---------------------------
-TIP 4: Don't 'kill -9' the postmaster
-From pgsql-hackers-owner+M18558=candle.pha.pa.us=pgman@postgresql.org Thu Feb 7 08:31:00 2002
-Return-path: <pgsql-hackers-owner+M18558=candle.pha.pa.us=pgman@postgresql.org>
-Received: from server1.pgsql.org (www.postgresql.org [])
- by candle.pha.pa.us (8.11.6/8.10.1) with SMTP id g17DUxP13923
- for <pgman@candle.pha.pa.us>; Thu, 7 Feb 2002 08:30:59 -0500 (EST)
-Received: (qmail 91796 invoked by alias); 7 Feb 2002 13:30:55 -0000
-Received: from unknown (HELO postgresql.org) (
- by www.postgresql.org with SMTP; 7 Feb 2002 13:30:55 -0000
-Received: from snoopy.mohawksoft.com (h0050bf7a618d.ne.mediaone.net [])
- by postgresql.org (8.11.3/8.11.4) with ESMTP id g17Cw0E87782
- for <pgsql-hackers@postgresql.org>; Thu, 7 Feb 2002 07:58:01 -0500 (EST)
- (envelope-from markw@mohawksoft.com)
-Received: from mohawksoft.com (localhost [])
- by snoopy.mohawksoft.com (8.11.6/8.11.6) with ESMTP id g17CqNt16887;
- Thu, 7 Feb 2002 07:52:24 -0500
-Message-ID: <3C627887.CC9FF837@mohawksoft.com>
-Date: Thu, 07 Feb 2002 07:52:23 -0500
-From: mlw <markw@mohawksoft.com>
-X-Mailer: Mozilla 4.78 [en] (X11; U; Linux 2.4.17 i686)
-X-Accept-Language: en
-MIME-Version: 1.0
-To: Gavin Sherry <swm@linuxworld.com.au>
-cc: PostgreSQL-development <pgsql-hackers@postgresql.org>
-Subject: Re: [HACKERS] Replication
-References: <Pine.LNX.4.21.0202071751240.5160-100000@linuxworld.com.au>
-Content-Type: text/plain; charset=us-ascii
-Content-Transfer-Encoding: 7bit
-Precedence: bulk
-Sender: pgsql-hackers-owner@postgresql.org
-Status: OR
-Gavin Sherry wrote:
-> Naturally, this would slow down writes to the system (possibly a lot
-> depending on the performance difference between the executing machine and
-> the least powerful machine in the cluster), but most usages of postgresql
-> are read intensive, not write.
-> Any reason this model would not work?
-What, then is the purpose of replication to multiple masters?
-I can think of only two reasons why you want replication. (1) Redundancy, make
-sure that if one server dies, then another server has the same data and is used
-seamlessly. (2) Increase performance over one system.
-In reason (1) I submit that a server load balance which sits on top of
-PostgreSQL, and executes writes on both servers while distributing reads would
-be best. This is a HUGE project. The load balancer must know EXACTLY how the
-system is configured, which includes all functions and everything.
-In reason (2) your system would fail to provide the scalability that would be
-needed. If writes take a long time, but reads are fine, what is the difference
-between the trigger based replicator?
-I have in the back of my mind, an idea of patching into the WAL stuff, and
-using that mechanism to push changes out to the slaves.
-Where one machine is still the master, but no trigger stuff, just a WAL patch.
-Perhaps some shared memory paradigm to manage WAL visibility? I'm not sure
-exactly, the idea hasn't completely formed yet.
----------------------------(end of broadcast)---------------------------
-TIP 5: Have you checked our extensive FAQ?
-From pgsql-hackers-owner+M18574=candle.pha.pa.us=pgman@postgresql.org Thu Feb 7 12:51:42 2002
-Return-path: <pgsql-hackers-owner+M18574=candle.pha.pa.us=pgman@postgresql.org>
-Received: from server1.pgsql.org (www.postgresql.org [])
- by candle.pha.pa.us (8.11.6/8.10.1) with SMTP id g17HpfP16661
- for <pgman@candle.pha.pa.us>; Thu, 7 Feb 2002 12:51:41 -0500 (EST)
-Received: (qmail 62955 invoked by alias); 7 Feb 2002 17:50:42 -0000
-Received: from unknown (HELO postgresql.org) (
- by www.postgresql.org with SMTP; 7 Feb 2002 17:50:42 -0000
-Received: from www1.navtechinc.com ([])
- by postgresql.org (8.11.3/8.11.4) with ESMTP id g17HnTE62256
- for <pgsql-hackers@postgresql.org>; Thu, 7 Feb 2002 12:49:29 -0500 (EST)
- (envelope-from ssinger@navtechinc.com)
-Received: from pcNavYkfAdm1.ykf.navtechinc.com (wall [])
- by www1.navtechinc.com (8.9.3/8.9.3) with ESMTP id RAA07908;
- Thu, 7 Feb 2002 17:49:31 GMT
-Received: from localhost (ssinger@localhost)
- by pcNavYkfAdm1.ykf.navtechinc.com (8.9.3/8.9.3) with ESMTP id RAA05687;
- Thu, 7 Feb 2002 17:48:52 GMT
-Date: Thu, 7 Feb 2002 17:48:51 +0000 (GMT)
-From: Steven Singer <ssinger@navtechinc.com>
-X-X-Sender: <ssinger@pcNavYkfAdm1.ykf.navtechinc.com>
-To: Gavin Sherry <swm@linuxworld.com.au>
-cc: mlw <markw@mohawksoft.com>,
- PostgreSQL-development <pgsql-hackers@postgresql.org>
-Subject: Re: [HACKERS] Replication
-In-Reply-To: <Pine.LNX.4.21.0202071751240.5160-100000@linuxworld.com.au>
-Message-ID: <Pine.LNX.4.33.0202071735360.6435-100000@pcNavYkfAdm1.ykf.navtechinc.com>
-MIME-Version: 1.0
-Content-Type: TEXT/PLAIN; charset=US-ASCII
-Precedence: bulk
-Sender: pgsql-hackers-owner@postgresql.org
-Status: OR
-What you describe sounds like a form of a two-stage commit protocol.
-If the command worked on two of the replicated databases but failed on a
-third then the executing server would have to be able to undo the command
-on the replicated databases as well as itself.
-The problems with two stage commit type approches to replication are
-1) Speed as you mentioned. Write speed isn't a concern for some
-applications but it is very important in others.
-2) All of the databases must be able to communicate with each other at
-all times in order for any edits to work. If the servers are
-connected over some sort of WAN that periodically has short outages this
-is a problem. Also if your using replication because you want to be able
-to take down one of the databases for short periods of time without
-bringing down the others your in trouble.
-btw: I posted the alternative to Rserv that I mentioned the other day to
-the pg-patches mailing list. If anyone is intreasted you should be able
-to grab it off the archives.
-On Thu, 7 Feb 2002, Gavin Sherry wrote:
-> First of all, all machines in the cluster would have to be aware all the
-> machines in the cluster. This would have to be stored in a new system
-> table.
-> The FE/BE protocol would need to be modified to accepted parsed node trees
-> generated by pg_analyze_and_rewrite(). These could then be dispatched by
-> the executing server, inside of pg_exec_query_string, to all other servers
-> in the cluster (excluding itself). Naturally, this dispatch would need to
-> be non-blocking.
-> pg_exec_query_string() would need to check that nodetags to make sure
-> selects and perhaps some commands are not dispatched.
-> Before the executing server runs finish_xact_command(), it would check
-> that the query was successfully executed on all machines otherwise
-> abort. Such a system would need a few configuration options: whether or
-> not you abort on failed replication to slaves, the ability to replicate
-> only certain tables, etc.
-> Naturally, this would slow down writes to the system (possibly a lot
-> depending on the performance difference between the executing machine and
-> the least powerful machine in the cluster), but most usages of postgresql
-> are read intensive, not write.
-> Any reason this model would not work?
-> Gavin
-> ---------------------------(end of broadcast)---------------------------
-> TIP 4: Don't 'kill -9' the postmaster
-Steven Singer ssinger@navtechinc.com
-Aircraft Performance Systems Phone: 519-747-1170 ext 282
-Navtech Systems Support Inc. AFTN: CYYZXNSX SITA: YYZNSCR
-Waterloo, Ontario ARINC: YKFNSCR
----------------------------(end of broadcast)---------------------------
-TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
-From pgsql-hackers-owner+M18590=candle.pha.pa.us=pgman@postgresql.org Thu Feb 7 17:50:42 2002
-Return-path: <pgsql-hackers-owner+M18590=candle.pha.pa.us=pgman@postgresql.org>
-Received: from server1.pgsql.org (www.postgresql.org [])
- by candle.pha.pa.us (8.11.6/8.10.1) with SMTP id g17MoeP27121
- for <pgman@candle.pha.pa.us>; Thu, 7 Feb 2002 17:50:40 -0500 (EST)
-Received: (qmail 39930 invoked by alias); 7 Feb 2002 22:50:17 -0000
-Received: from unknown (HELO postgresql.org) (
- by www.postgresql.org with SMTP; 7 Feb 2002 22:50:17 -0000
-Received: from odin.fts.net (wall.icgate.net [])
- by postgresql.org (8.11.3/8.11.4) with ESMTP id g17Ma4E38041
- for <pgsql-hackers@postgresql.org>; Thu, 7 Feb 2002 17:36:04 -0500 (EST)
- (envelope-from fharvell@odin.fts.net)
-Received: from odin.fts.net (fharvell@localhost)
- by odin.fts.net (8.11.6/8.11.6) with ESMTP id g17MZhR17707;
- Thu, 7 Feb 2002 17:35:43 -0500
-Message-ID: <200202072235.g17MZhR17707@odin.fts.net>
-X-Mailer: exmh version 2.2 06/23/2000 with nmh-1.0.4
-From: F Harvell <fharvell@fts.net>
-To: mlw <markw@mohawksoft.com>
-cc: Gavin Sherry <swm@linuxworld.com.au>,
- PostgreSQL-development <pgsql-hackers@postgresql.org>
-Subject: Re: [HACKERS] Replication
-In-Reply-To: Message from mlw
- of "Thu, 07 Feb 2002 07:52:23 EST."
- <3C627887.CC9FF837@mohawksoft.com>
-MIME-Version: 1.0
-Content-Type: text/plain; charset=us-ascii
-Date: Thu, 07 Feb 2002 17:35:43 -0500
-Precedence: bulk
-Sender: pgsql-hackers-owner@postgresql.org
-Status: OR
-I'm not that familiar with the whole replication issues in PostgreSQL,
-however, I would be partial to replication that was based upon the
-playback of the (a?) journal file. (I believe that the WAL is a
-journal file.)
-By being based upon a journal file, it would be possible to accomplish
-two significant items. First, it would be possible to "restore" a
-database to an exact state just before a failure. Most commercial
-databases provide the ability to do this. Banks, etc. log the journal
-files directly to tape to provide a complete transaction history such
-that they can rebuild their database from any given snapshot. (Note
-that the journal file needs to be "editable" as a failure may be
-"delete from x" with a missing where clause.)
-This leads directly into the second advantage, the ability to have a
-replicated database operating anywhere, over any connection on any
-server. Speed of writes would not be a factor. In essence, as long
-as the replicated database had a snapshot of the database and then was
-provided with all journal files since the snapshot, it would be
-possible to build a current database. If the replicant got behind in
-the processing, it would catch up when things slowed down.
-In my opionion, the first advantage is in many ways most important.
-Replication becomes simply the restoration of the database in realtime
-on a second server. The "replication" task becomes the definition of
-a protocol for distributing the journal file. At least one major
-database vendor does replication (shadowing) in exactly this mannor.
-Maybe I'm all wet and the journal file and journal playback already
-exists. If so, IMHO, basing replication off of this would be the
-right direction.
-On Thu, 07 Feb 2002 07:52:23 EST, mlw wrote:
-> I have in the back of my mind, an idea of patching into the WAL stuff, and
-> using that mechanism to push changes out to the slaves.
-> Where one machine is still the master, but no trigger stuff, just a WAL patch.
-> Perhaps some shared memory paradigm to manage WAL visibility? I'm not sure
-> exactly, the idea hasn't completely formed yet.
----------------------------(end of broadcast)---------------------------
-TIP 4: Don't 'kill -9' the postmaster
-From pgsql-hackers-owner+M18605=candle.pha.pa.us=pgman@postgresql.org Fri Feb 8 00:50:08 2002
-Return-path: <pgsql-hackers-owner+M18605=candle.pha.pa.us=pgman@postgresql.org>
-Received: from server1.pgsql.org (www.postgresql.org [])
- by candle.pha.pa.us (8.11.6/8.10.1) with SMTP id g185o7P27878
- for <pgman@candle.pha.pa.us>; Fri, 8 Feb 2002 00:50:07 -0500 (EST)
-Received: (qmail 17348 invoked by alias); 8 Feb 2002 05:50:03 -0000
-Received: from unknown (HELO postgresql.org) (
- by www.postgresql.org with SMTP; 8 Feb 2002 05:50:03 -0000
-Received: from lakemtao03.mgt.cox.net (mtao3.east.cox.net [])
- by postgresql.org (8.11.3/8.11.4) with ESMTP id g185cTE15241
- for <pgsql-hackers@postgresql.org>; Fri, 8 Feb 2002 00:38:29 -0500 (EST)
- (envelope-from darren.johnson@cox.net)
-Received: from cox.net ([]) by lakemtao03.mgt.cox.net
- (InterMail vM. 201-253-122-122-105-20011231) with ESMTP
- id <20020208053833.YKTV6710.lakemtao03.mgt.cox.net@cox.net>
- for <pgsql-hackers@postgresql.org>;
- Fri, 8 Feb 2002 00:38:33 -0500
-Message-ID: <3C636232.6060206@cox.net>
-Date: Fri, 08 Feb 2002 00:29:22 -0500
-From: Darren Johnson <darren.johnson@cox.net>
-User-Agent: Mozilla/5.0 (Windows; U; WinNT4.0; en-US; m18) Gecko/20001108 Netscape6/6.0
-X-Accept-Language: en
-MIME-Version: 1.0
-To: PostgreSQL-development <pgsql-hackers@postgresql.org>
-Subject: Re: [HACKERS] Replication
-References: <Pine.LNX.4.33.0202071735360.6435-100000@pcNavYkfAdm1.ykf.navtechinc.com>
-Content-Type: text/plain; charset=us-ascii; format=flowed
-Content-Transfer-Encoding: 7bit
-Precedence: bulk
-Sender: pgsql-hackers-owner@postgresql.org
-Status: OR
- >
- > The problems with two stage commit type approches to replication are
-IMHO the biggest problem with two phased commit is it doesn't scale.
-The more servers
-you add to the replica the slower it goes. Also there's the potential
-for dead locks across
-server boundaries.
- >
- > 2) All of the databases must be able to communicate with each other at
- > all times in order for any edits to work. If the servers are
- > connected over some sort of WAN that periodically has short outages this
- > is a problem. Also if your using replication because you want to be
- > to take down one of the databases for short periods of time without
- > bringing down the others your in trouble.
-All true for two phased commit protocol. To have multi master
-replication, you must have all
-systems communicating, but you can use a multicast group communication
-system instead of
-2PC. Using total order messaging, you can ensure all changes are
-delivered to all servers in the
-replica in the same order. This group communication system also allows
-failures to be detected
-while other servers in the replica continue processing.
-A few of us are working with this theory, and trying to integrate with
-7.2. There is a working
-model for 6.4, but its very limited. (insert, update, and deletes) We
-are currently hosted at
-But the site has been down the last 2 days. I've contacted the web
-master, but haven't seen
-any results yet. If any one knows what going on with gborg, I'd
-appreciate a status.
----------------------------(end of broadcast)---------------------------
-TIP 2: you can get off all lists at once with the unregister command
- (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
-From pgsql-hackers-owner+M18617=candle.pha.pa.us=pgman@postgresql.org Fri Feb 8 06:20:44 2002
-Return-path: <pgsql-hackers-owner+M18617=candle.pha.pa.us=pgman@postgresql.org>
-Received: from server1.pgsql.org (www.postgresql.org [])
- by candle.pha.pa.us (8.11.6/8.10.1) with SMTP id g18BKhP06132
- for <pgman@candle.pha.pa.us>; Fri, 8 Feb 2002 06:20:43 -0500 (EST)
-Received: (qmail 90815 invoked by alias); 8 Feb 2002 11:20:40 -0000
-Received: from unknown (HELO postgresql.org) (
- by www.postgresql.org with SMTP; 8 Feb 2002 11:20:40 -0000
-Received: from laptop.kieser.demon.co.uk (kieser.demon.co.uk [])
- by postgresql.org (8.11.3/8.11.4) with ESMTP id g18B9ZE89589
- for <pgsql-hackers@postgresql.org>; Fri, 8 Feb 2002 06:09:36 -0500 (EST)
- (envelope-from brad@kieser.net)
-Received: from laptop.kieser.demon.co.uk (localhost.localdomain [])
- by laptop.kieser.demon.co.uk (Postfix) with SMTP
- id 598393A132; Fri, 8 Feb 2002 11:09:36 +0000 (GMT)
-From: Bradley Kieser <brad@kieser.net>
-Date: Fri, 08 Feb 2002 11:09:36 GMT
-Message-ID: <20020208.11093600@laptop.kieser.demon.co.uk>
-Subject: Re: [HACKERS] Replication
-To: Darren Johnson <darren.johnson@cox.net>
-cc: PostgreSQL-development <pgsql-hackers@postgresql.org>
-In-Reply-To: <3C636232.6060206@cox.net>
-References: <Pine.LNX.4.33.0202071735360.6435-100000@pcNavYkfAdm1.ykf.navtechinc.com> <3C636232.6060206@cox.net>
-X-Mailer: Mozilla/3.0 (compatible; StarOffice/5.2;Linux)
-X-Priority: 3 (Normal)
-MIME-Version: 1.0
-Content-Type: text/plain; charset=ISO-8859-1
-Content-Transfer-Encoding: 8bit
-X-MIME-Autoconverted: from quoted-printable to 8bit by postgresql.org id g18BJoF90352
-Precedence: bulk
-Sender: pgsql-hackers-owner@postgresql.org
-Status: OR
-Given that different replication strategies will probably be developed
-for PG, do you envisage DBAs to be able to select the type of replication
-for their installation? I.e. Replication being selectable rther like
-storage structures?
-Would be a killer bit of flexibility, given how enormous the impact of
-replication will be to corporate adoption of PG.
->>>>>>>>>>>>>>>>>> Original Message <<<<<<<<<<<<<<<<<<
-On 2/8/02, 5:29:22 AM, Darren Johnson <darren.johnson@cox.net> wrote
-regarding Re: [HACKERS] Replication:
-> >
-> > The problems with two stage commit type approches to replication are
-> IMHO the biggest problem with two phased commit is it doesn't scale.
-> The more servers
-> you add to the replica the slower it goes. Also there's the potential
-> for dead locks across
-> server boundaries.
-> >
-> > 2) All of the databases must be able to communicate with each other at
-> > all times in order for any edits to work. If the servers are
-> > connected over some sort of WAN that periodically has short outages this
-> > is a problem. Also if your using replication because you want to be
-> able
-> > to take down one of the databases for short periods of time without
-> > bringing down the others your in trouble.
-> All true for two phased commit protocol. To have multi master
-> replication, you must have all
-> systems communicating, but you can use a multicast group communication
-> system instead of
-> 2PC. Using total order messaging, you can ensure all changes are
-> delivered to all servers in the
-> replica in the same order. This group communication system also allows
-> failures to be detected
-> while other servers in the replica continue processing.
-> A few of us are working with this theory, and trying to integrate with
-> 7.2. There is a working
-> model for 6.4, but its very limited. (insert, update, and deletes) We
-> are currently hosted at
-> http://gborg.postgresql.org/project/pgreplication/projdisplay.php
-> But the site has been down the last 2 days. I've contacted the web
-> master, but haven't seen
-> any results yet. If any one knows what going on with gborg, I'd
-> appreciate a status.
-> Darren
-> ---------------------------(end of broadcast)---------------------------
-> TIP 2: you can get off all lists at once with the unregister command
-> (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
----------------------------(end of broadcast)---------------------------
-TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
-From pgsql-hackers-owner+M18642=candle.pha.pa.us=pgman@postgresql.org Fri Feb 8 12:40:36 2002
-Return-path: <pgsql-hackers-owner+M18642=candle.pha.pa.us=pgman@postgresql.org>
-Received: from server1.pgsql.org (www.postgresql.org [])
- by candle.pha.pa.us (8.11.6/8.10.1) with SMTP id g18HeZP08450
- for <pgman@candle.pha.pa.us>; Fri, 8 Feb 2002 12:40:35 -0500 (EST)
-Received: (qmail 74089 invoked by alias); 8 Feb 2002 17:40:30 -0000
-Received: from unknown (HELO postgresql.org) (
- by www.postgresql.org with SMTP; 8 Feb 2002 17:40:30 -0000
-Received: from lakemtao03.mgt.cox.net (mtao3.east.cox.net [])
- by postgresql.org (8.11.3/8.11.4) with ESMTP id g18HbwE73437
- for <pgsql-hackers@postgresql.org>; Fri, 8 Feb 2002 12:37:58 -0500 (EST)
- (envelope-from darren.johnson@cox.net)
-Received: from cox.net ([]) by lakemtao03.mgt.cox.net
- (InterMail vM. 201-253-122-122-105-20011231) with ESMTP
- id <20020208173804.DKQS6710.lakemtao03.mgt.cox.net@cox.net>;
- Fri, 8 Feb 2002 12:38:04 -0500
-Message-ID: <3C63FB71.206@cox.net>
-Date: Fri, 08 Feb 2002 11:23:13 -0500
-From: Darren Johnson <darren.johnson@cox.net>
-User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; m18) Gecko/20010131 Netscape6/6.01
-X-Accept-Language: en
-MIME-Version: 1.0
-To: Bradley Kieser <brad@kieser.net>
-cc: pgsql-hackers@postgresql.org
-Subject: Re: [HACKERS] Replication
-References: <Pine.LNX.4.33.0202071735360.6435-100000@pcNavYkfAdm1.ykf.navtechinc.com> <3C636232.6060206@cox.net> <20020208.11093600@laptop.kieser.demon.co.uk>
-Content-Type: text/plain; charset=us-ascii; format=flowed
-Content-Transfer-Encoding: 7bit
-Precedence: bulk
-Sender: pgsql-hackers-owner@postgresql.org
-Status: OR
-> Given that different replication strategies will probably be developed
-> for PG, do you envisage DBAs to be able to select the type of replication
-> for their installation? I.e. Replication being selectable rther like
-> storage structures?
-I can't speak for other replication solutions, but we are using the
---with-replication or
--r parameter when starting postmaster. Some day I hope there will be
-parameters for
-master/slave partial/full and sync/async, but it will be some time
-before we cross those
----------------------------(end of broadcast)---------------------------
-TIP 6: Have you searched our list archives?
-From pgsql-hackers-owner+M18658=candle.pha.pa.us=pgman@postgresql.org Fri Feb 8 14:42:40 2002
-Return-path: <pgsql-hackers-owner+M18658=candle.pha.pa.us=pgman@postgresql.org>
-Received: from server1.pgsql.org (www.postgresql.org [])
- by candle.pha.pa.us (8.11.6/8.10.1) with SMTP id g18JgdP28166
- for <pgman@candle.pha.pa.us>; Fri, 8 Feb 2002 14:42:39 -0500 (EST)
-Received: (qmail 18650 invoked by alias); 8 Feb 2002 19:42:39 -0000
-Received: from unknown (HELO postgresql.org) (
- by www.postgresql.org with SMTP; 8 Feb 2002 19:42:39 -0000
-Received: from enigma.trueimpact.net (enigma.trueimpact.net [])
- by postgresql.org (8.11.3/8.11.4) with ESMTP id g18JYBE17341
- for <pgsql-hackers@postgresql.org>; Fri, 8 Feb 2002 14:34:11 -0500 (EST)
- (envelope-from rjonasz@trueimpact.com)
-Received: from nietzsche.trueimpact.net (unknown [])
- by enigma.trueimpact.net (Postfix) with ESMTP id A785066B04
- for <pgsql-hackers@postgresql.org>; Fri, 8 Feb 2002 14:33:28 -0500 (EST)
-Date: Fri, 8 Feb 2002 14:34:34 -0500 (EST)
-From: Randall Jonasz <rjonasz@trueimpact.com>
-X-X-Sender: <rjonasz@nietzsche.trueimpact.net>
-To: PostgreSQL-development <pgsql-hackers@postgresql.org>
-Subject: Re: [HACKERS] Replication
-In-Reply-To: <3C627887.CC9FF837@mohawksoft.com>
-Message-ID: <20020208142932.H6545-100000@nietzsche.trueimpact.net>
-MIME-Version: 1.0
-Content-Type: TEXT/PLAIN; charset=US-ASCII
-Precedence: bulk
-Sender: pgsql-hackers-owner@postgresql.org
-Status: OR
-I've been looking into database replication theory lately and have found
-some interesting papers discussing various approaches. (Here's
-one paper that struck me as being very helpful,
-http://citeseer.nj.nec.com/460405.html ) So far I favour an
-eager replication system which is predicated on a read local/write all
-available. The system should not depend on two phase commit or primary
-copy algorithms. The former leads to the whole system being as quick as
-the slowest machine. In addition, 2 phase commit involves 2n messages for
-each transaction which does not scale well at all. This idea will also
-have to take into account a crashed node which did not ack a transaction.
-The primary copy algorithms I've seen suffer from a single point of
-failure and potential bottlenecks at the primary node.
-Instead I like the master to master or peer to peer algorithm as discussed
-in the above paper. This approach accounts for network partitions, nodes
-leaving and joining a cluster and the ability to commit a transaction once
-the communication module has determined the total order of the said
-transaction, i.e. no need for waiting for acks. This scales well and
-research has shown it to increase the number of transactions/second a
-database cluster can handle over a single node.
-Postgres-R is another interesting approach which I think should be taken
-seriously. Anyone interested can read a paper on this at
-Anyways, my two cents
-Randall Jonasz
-Software Engineer
-Click2net Inc.
-On Thu, 7 Feb 2002, mlw wrote:
-> Gavin Sherry wrote:
-> > Naturally, this would slow down writes to the system (possibly a lot
-> > depending on the performance difference between the executing machine and
-> > the least powerful machine in the cluster), but most usages of postgresql
-> > are read intensive, not write.
-> >
-> > Any reason this model would not work?
-> What, then is the purpose of replication to multiple masters?
-> I can think of only two reasons why you want replication. (1) Redundancy, make
-> sure that if one server dies, then another server has the same data and is used
-> seamlessly. (2) Increase performance over one system.
-> In reason (1) I submit that a server load balance which sits on top of
-> PostgreSQL, and executes writes on both servers while distributing reads would
-> be best. This is a HUGE project. The load balancer must know EXACTLY how the
-> system is configured, which includes all functions and everything.
-> In reason (2) your system would fail to provide the scalability that would be
-> needed. If writes take a long time, but reads are fine, what is the difference
-> between the trigger based replicator?
-> I have in the back of my mind, an idea of patching into the WAL stuff, and
-> using that mechanism to push changes out to the slaves.
-> Where one machine is still the master, but no trigger stuff, just a WAL patch.
-> Perhaps some shared memory paradigm to manage WAL visibility? I'm not sure
-> exactly, the idea hasn't completely formed yet.
-> ---------------------------(end of broadcast)---------------------------
-> TIP 5: Have you checked our extensive FAQ?
-> http://www.postgresql.org/users-lounge/docs/faq.html
----------------------------(end of broadcast)---------------------------
-TIP 5: Have you checked our extensive FAQ?
-From pgsql-hackers-owner+M18660=candle.pha.pa.us=pgman@postgresql.org Fri Feb 8 15:20:32 2002
-Return-path: <pgsql-hackers-owner+M18660=candle.pha.pa.us=pgman@postgresql.org>
-Received: from server1.pgsql.org (www.postgresql.org [])
- by candle.pha.pa.us (8.11.6/8.10.1) with SMTP id g18KKSP03731
- for <pgman@candle.pha.pa.us>; Fri, 8 Feb 2002 15:20:29 -0500 (EST)
-Received: (qmail 28961 invoked by alias); 8 Feb 2002 20:20:27 -0000
-Received: from unknown (HELO postgresql.org) (
- by www.postgresql.org with SMTP; 8 Feb 2002 20:20:27 -0000
-Received: from inflicted.crimelabs.net (crimelabs.net [])
- by postgresql.org (8.11.3/8.11.4) with ESMTP id g18KC7E27667
- for <pgsql-hackers@postgresql.org>; Fri, 8 Feb 2002 15:12:07 -0500 (EST)
- (envelope-from bpalmer@crimelabs.net)
-Received: from mizer.crimelabs.net (mizer.crimelabs.net [])
- by inflicted.crimelabs.net (Postfix) with ESMTP
- id 1066F8787; Fri, 8 Feb 2002 15:12:08 -0500 (EST)
-Date: Fri, 8 Feb 2002 15:12:00 -0500 (EST)
-From: bpalmer <bpalmer@crimelabs.net>
-To: Randall Jonasz <rjonasz@trueimpact.com>
-cc: PostgreSQL-development <pgsql-hackers@postgresql.org>
-Subject: Re: [HACKERS] Replication
-In-Reply-To: <20020208142932.H6545-100000@nietzsche.trueimpact.net>
-Message-ID: <Pine.BSO.4.43.0202081510130.21860-100000@mizer.crimelabs.net>
-MIME-Version: 1.0
-Content-Type: TEXT/PLAIN; charset=US-ASCII
-Precedence: bulk
-Sender: pgsql-hackers-owner@postgresql.org
-Status: OR
-I've not looked at the first paper, but I wil.
-> Postgres-R is another interesting approach which I think should be taken
-> seriously. Anyone interested can read a paper on this at
-> http://citeseer.nj.nec.com/330257.html
-I would point you to the info on gborg, but it seems to be down at the
-- Brandon
- c: 646-456-5455 h: 201-798-4983
- b. palmer, bpalmer@crimelabs.net pgp:crimelabs.net/bpalmer.pgp5
----------------------------(end of broadcast)---------------------------
-TIP 3: if posting/reading through Usenet, please send an appropriate
-subscribe-nomail command to majordomo@postgresql.org so that your
-message can get through to the mailing list cleanly
-From pgsql-hackers-owner+M18666=candle.pha.pa.us=pgman@postgresql.org Fri Feb 8 17:41:03 2002
-Return-path: <pgsql-hackers-owner+M18666=candle.pha.pa.us=pgman@postgresql.org>
-Received: from server1.pgsql.org (www.postgresql.org [])
- by candle.pha.pa.us (8.11.6/8.10.1) with SMTP id g18Mf2P18046
- for <pgman@candle.pha.pa.us>; Fri, 8 Feb 2002 17:41:03 -0500 (EST)
-Received: (qmail 63057 invoked by alias); 8 Feb 2002 22:41:02 -0000
-Received: from unknown (HELO postgresql.org) (
- by www.postgresql.org with SMTP; 8 Feb 2002 22:41:02 -0000
-Received: from lakemtao03.mgt.cox.net (mtao3.east.cox.net [])
- by postgresql.org (8.11.3/8.11.4) with ESMTP id g18MR9E60361
- for <pgsql-hackers@postgresql.org>; Fri, 8 Feb 2002 17:27:11 -0500 (EST)
- (envelope-from darren.johnson@cox.net)
-Received: from cox.net ([]) by lakemtao03.mgt.cox.net
- (InterMail vM. 201-253-122-122-105-20011231) with ESMTP
- id <20020208222634.GTRG6710.lakemtao03.mgt.cox.net@cox.net>;
- Fri, 8 Feb 2002 17:26:34 -0500
-Message-ID: <3C643F0F.70303@cox.net>
-Date: Fri, 08 Feb 2002 16:11:43 -0500
-From: Darren Johnson <darren.johnson@cox.net>
-User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; m18) Gecko/20010131 Netscape6/6.01
-X-Accept-Language: en
-MIME-Version: 1.0
-To: Randall Jonasz <rjonasz@trueimpact.com>
-cc: PostgreSQL-development <pgsql-hackers@postgresql.org>
-Subject: Re: [HACKERS] Replication
-References: <20020208142932.H6545-100000@nietzsche.trueimpact.net>
-Content-Type: text/plain; charset=us-ascii; format=flowed
-Content-Transfer-Encoding: 7bit
-Precedence: bulk
-Sender: pgsql-hackers-owner@postgresql.org
-Status: OR
-> I've been looking into database replication theory lately and have found
-> some interesting papers discussing various approaches. (Here's
-> one paper that struck me as being very helpful,
-> http://citeseer.nj.nec.com/460405.html )
-Here is another one from that same group, that addresses the WAN issues.
-> http://www.cnds.jhu.edu/pub/papers/cnds-2002-1.pdf
----------------------------(end of broadcast)---------------------------
-TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
-From pgsql-hackers-owner+M18674=candle.pha.pa.us=pgman@postgresql.org Fri Feb 8 19:20:30 2002
-Return-path: <pgsql-hackers-owner+M18674=candle.pha.pa.us=pgman@postgresql.org>
-Received: from server1.pgsql.org (www.postgresql.org [])
- by candle.pha.pa.us (8.11.6/8.10.1) with SMTP id g190KTP26980
- for <pgman@candle.pha.pa.us>; Fri, 8 Feb 2002 19:20:29 -0500 (EST)
-Received: (qmail 88124 invoked by alias); 9 Feb 2002 00:20:27 -0000
-Received: from unknown (HELO postgresql.org) (
- by www.postgresql.org with SMTP; 9 Feb 2002 00:20:27 -0000
-Received: from localhost.localdomain (bgp01077650bgs.wanarb01.mi.comcast.net [])
- by postgresql.org (8.11.3/8.11.4) with ESMTP id g190H3E87489
- for <pgsql-hackers@postgresql.org>; Fri, 8 Feb 2002 19:17:03 -0500 (EST)
- (envelope-from camber@ais.org)
-Received: from localhost (camber@localhost)
- by localhost.localdomain (8.11.6/8.11.6) with ESMTP id g190H0P18427;
- Fri, 8 Feb 2002 19:17:00 -0500
-X-Authentication-Warning: localhost.localdomain: camber owned process doing -bs
-Date: Fri, 8 Feb 2002 19:17:00 -0500 (EST)
-From: Brian Bruns <camber@ais.org>
-X-X-Sender: <camber@localhost.localdomain>
-To: Randall Jonasz <rjonasz@trueimpact.com>
-cc: PostgreSQL-development <pgsql-hackers@postgresql.org>
-Subject: Re: [HACKERS] Replication
-In-Reply-To: <20020208142932.H6545-100000@nietzsche.trueimpact.net>
-Message-ID: <Pine.LNX.4.33.0202081904190.18420-100000@localhost.localdomain>
-MIME-Version: 1.0
-Content-Type: TEXT/PLAIN; charset=US-ASCII
-Precedence: bulk
-Sender: pgsql-hackers-owner@postgresql.org
-Status: OR
-> > I have in the back of my mind, an idea of patching into the WAL stuff, and
-> > using that mechanism to push changes out to the slaves.
-> >
-> > Where one machine is still the master, but no trigger stuff, just a WAL patch.
-> > Perhaps some shared memory paradigm to manage WAL visibility? I'm not sure
-> > exactly, the idea hasn't completely formed yet.
-> >
-FWIW, Sybase Replication Server does just such a thing.
-They have a secondary log marker (prevents the log from truncating past
-the oldest unreplicated transaction). A thread within the system called
-the "rep agent" (but it use to be a separate process call the LTM), reads
-the log and forwards it to the rep server, once the rep server has the
-whole transaction and it is written to a stable device (aka synced to
-disk) the rep server responds to the LTM telling him it's OK to move the
-log marker forward.
-Anyway, once the replication server proper has the transaction it uses a
-publish/subscribe methodology to see who wants get the update.
-Bidirectional replication is done by making two oneway replications. The
-whole thing is table based, it marks the tables as replicated or not in
-the database to save the trip to the repserver on un replicated tables.
-Plus you can take parts of a database (replicate all rows where the
-country is "us" to this server and all the rows with "uk" to that server).
-Or opposite you can roll up smaller regional databases to bigger ones,
-it's very flexible.
----------------------------(end of broadcast)---------------------------
-TIP 4: Don't 'kill -9' the postmaster
+++ /dev/null
-From pgsql-hackers-owner+M1833@hub.org Sat May 13 22:49:26 2000
-Received: from news.tht.net (news.hub.org [])
- by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id WAA07394
- for <pgman@candle.pha.pa.us>; Sat, 13 May 2000 22:49:24 -0400 (EDT)
-Received: from hub.org (majordom@hub.org [])
- by news.tht.net (8.9.3/8.9.3) with ESMTP id WAB99859;
- Sat, 13 May 2000 22:44:15 -0400 (EDT)
- (envelope-from pgsql-hackers-owner+M1833@hub.org)
-Received: from sss2.sss.pgh.pa.us (sss.pgh.pa.us [])
- by hub.org (8.9.3/8.9.3) with ESMTP id WAA51058
- for <pgsql-hackers@postgreSQL.org>; Sat, 13 May 2000 22:41:16 -0400 (EDT)
- (envelope-from tgl@sss.pgh.pa.us)
-Received: from sss2.sss.pgh.pa.us (tgl@localhost [])
- by sss2.sss.pgh.pa.us (8.9.3/8.9.3) with ESMTP id WAA18343
- for <pgsql-hackers@postgreSQL.org>; Sat, 13 May 2000 22:40:38 -0400 (EDT)
-To: pgsql-hackers@postgresql.org
-Subject: [HACKERS] Proposal for fixing numeric type-resolution issues
-Date: Sat, 13 May 2000 22:40:38 -0400
-Message-ID: <18340.958272038@sss.pgh.pa.us>
-From: Tom Lane <tgl@sss.pgh.pa.us>
-X-Mailing-List: pgsql-hackers@postgresql.org
-Precedence: bulk
-Sender: pgsql-hackers-owner@hub.org
-Status: ORr
-We've got a collection of problems that are related to the parser's
-inability to make good type-resolution choices for numeric constants.
-In some cases you get a hard error; for example "NumericVar + 4.4"
-ERROR: Unable to identify an operator '+' for types 'numeric' and 'float8'
- You will have to retype this query using an explicit cast
-because "4.4" is initially typed as float8 and the system can't figure
-out whether to use numeric or float8 addition. A more subtle problem
-is that a query like "... WHERE Int2Var < 42" is unable to make use of
-an index on the int2 column: 42 is resolved as int4, so the operator
-is int24lt, which works but is not in the opclass of an int2 index.
-Here is a proposal for fixing these problems. I think we could get this
-done for 7.1 if people like it.
-The basic problem is that there's not enough smarts in the type resolver
-about the interrelationships of the numeric datatypes. All it has is
-a concept of a most-preferred type within the category of numeric types.
-(We are abusing the most-preferred-type mechanism, BTW, because both
-FLOAT8 and NUMERIC claim to be the most-preferred type in the numeric
-category! This is in fact why the resolver can't make a choice for
-"numeric+float8".) We need more intelligence than that.
-I propose that we set up a strictly-ordered hierarchy of numeric
-datatypes, running from least preferred to most preferred:
- int2, int4, int8, numeric, float4, float8.
-Rather than simply considering coercions to the most-preferred type,
-the type resolver should use the following rules:
-1. No value will be down-converted (eg int4 to int2) except by an
-explicit conversion.
-2. If there is not an exact matching operator, numeric values will be
-up-converted to the highest numeric datatype present among the operator
-or function's arguments. For example, given "int2 + int8" we'd up-
-convert the int2 to int8 and apply int8 addition.
-The final piece of the puzzle is that the type initially assigned to
-an undecorated numeric constant should be NUMERIC if it contains a
-decimal point or exponent, and otherwise the smallest of int2, int4,
-int8, NUMERIC that will represent it. This is a considerable change
-from the current lexer behavior, where you get either int4 or float8.
-For example, given "NumericVar + 4.4", the constant 4.4 will initially
-be assigned type NUMERIC, we will resolve the operator as numeric plus,
-and everything's fine. Given "Float8Var + 4.4", the constant is still
-initially numeric, but will be up-converted to float8 so that float8
-addition can be used. The end result is the same as in traditional
-Postgres: you get float8 addition. Given "Int2Var < 42", the constant
-is initially typed as int2, since it fits, and we end up selecting
-int2lt, thereby allowing use of an int2 index. (On the other hand,
-given "Int2Var < 100000", we'd end up using int4lt, which is correct
-to avoid overflow.)
-A couple of crucial subtleties here:
-1. We are assuming that the parser or optimizer will constant-fold
-any conversion functions that are introduced. Thus, in the
-"Float8Var + 4.4" case, the 4.4 is represented as a float8 4.4 by the
-time execution begins, so there's no performance loss.
-2. We cannot lose precision by initially representing a constant as
-numeric and later converting it to float. Nor can we exceed NUMERIC's
-range (the default 1000-digit limit is more than the range of IEEE
-float8 data). It would not work as well to start out by representing
-a constant as float and then converting it to numeric.
-Presently, the pg_proc and pg_operator tables contain a pretty fair
-collection of cross-datatype numeric operators, such as int24lt,
-float48pl, etc. We could perhaps leave these in, but I believe that
-it is better to remove them. For example, if int42lt is left in place,
-then it would capture cases like "Int4Var < 42", whereas we need that
-to be translated to int4lt so that an int4 index can be used. Removing
-these operators will eliminate some code bloat and system-catalog bloat
-to boot.
-As far as I can tell, this proposal is almost compatible with the rules
-given in SQL92: in particular, SQL92 specifies that an operator having
-both "approximate numeric" (float) and "exact numeric" (int or numeric)
-inputs should deliver an approximate-numeric result. I propose
-deviating from SQL92 in a single respect: SQL92 specifies that a
-constant containing an exponent (eg 1.2E34) is approximate numeric,
-which implies that the result of an operator using it is approximate
-even if the other operand is exact. I believe it's better to treat
-such a constant as exact (ie, type NUMERIC) and only convert it to
-float if the other operand is float. Without doing that, an assignment
- UPDATE tab SET NumericVar = 1.234567890123456789012345E34;
-will not work as desired because the constant will be prematurely
-coerced to float, causing precision loss.
- regards, tom lane
-From tgl@sss.pgh.pa.us Sun May 14 17:30:56 2000
-Received: from renoir.op.net (root@renoir.op.net [])
- by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id RAA05808
- for <pgman@candle.pha.pa.us>; Sun, 14 May 2000 17:30:52 -0400 (EDT)
-Received: from sss2.sss.pgh.pa.us (sss.pgh.pa.us []) by renoir.op.net (o1/$Revision: 1.4 $) with ESMTP id RAA16657 for <pgman@candle.pha.pa.us>; Sun, 14 May 2000 17:29:52 -0400 (EDT)
-Received: from sss2.sss.pgh.pa.us (tgl@localhost [])
- by sss2.sss.pgh.pa.us (8.9.3/8.9.3) with ESMTP id RAA20914;
- Sun, 14 May 2000 17:29:30 -0400 (EDT)
-To: Bruce Momjian <pgman@candle.pha.pa.us>
-cc: PostgreSQL-development <pgsql-hackers@postgreSQL.org>
-Subject: Re: [HACKERS] type conversion discussion
-In-reply-to: <200005141950.PAA04636@candle.pha.pa.us>
-References: <200005141950.PAA04636@candle.pha.pa.us>
-Comments: In-reply-to Bruce Momjian <pgman@candle.pha.pa.us>
- message dated "Sun, 14 May 2000 15:50:20 -0400"
-Date: Sun, 14 May 2000 17:29:30 -0400
-Message-ID: <20911.958339770@sss.pgh.pa.us>
-From: Tom Lane <tgl@sss.pgh.pa.us>
-Status: OR
-Bruce Momjian <pgman@candle.pha.pa.us> writes:
-> As some point, it seems we need to get all the PostgreSQL minds together
-> to discuss type conversion issues. These problems continue to come up
-> from release to release. We are getting better, but it seems a full
-> discussion could help solidify our strategy.
-OK, here are a few things that bug me about the current type-resolution
-1. Poor choice of type to attribute to numeric literals. (A possible
- solution is sketched in my earlier message, but do we need similar
- mechanisms for other type categories?)
-2. Tensions between treating string literals as "unknown" type and
- as "text" type, per this thread so far.
-3. IS_BINARY_COMPATIBLE seems like a bogus concept. Do we really want a
- fully symmetrical ring of types in each group? I'd prefer to see a
- one-way equivalence, which allows eg. OID to be silently converted
- to INT4, but *not* vice versa (except perhaps by specific user cast).
- This'd be more like a traditional "is-a" or inheritance relationship
- between datatypes, which has well-understood semantics.
-4. I'm also concerned that the behavior of IS_BINARY_COMPATIBLE isn't
- very predictable because it will happily go either way. For example,
- if I do
- select * from pg_class where oid = 1234;
- it's unclear whether I will get an oideq or an int4eq operator ---
- and that's a rather critical point since only one of them can exploit
- an index on the oid column. Currently, there is some klugery in the
- planner that works around this by overriding the parser's choice of
- operator to substitute one that is compatible with an available index.
- That's a pretty ugly solution ... I'm not sure I know a better one,
- but as long as we're discussing type resolution issues ...
-5. Lack of extensibility. There's way too much knowledge hard-wired
- into the parser about type categories, preferred types, binary
- compatibility, etc. All of it falls down when faced with
- user-defined datatypes. If we do something like I suggested with
- a hardwired hierarchy of numeric datatypes, it'll get even worse.
- All this stuff ought to be driven off fields in pg_type rather than
- be hardwired into the code, so that the same concepts can be extended
- to user-defined types.
-I don't have worked-out proposals for any of these but the first,
-but they've all been bothering me for a while.
- regards, tom lane
-From tgl@sss.pgh.pa.us Sun May 14 21:02:31 2000
-Received: from sss2.sss.pgh.pa.us (sss.pgh.pa.us [])
- by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id VAA07700
- for <pgman@candle.pha.pa.us>; Sun, 14 May 2000 21:02:28 -0400 (EDT)
-Received: from sss2.sss.pgh.pa.us (tgl@localhost [])
- by sss2.sss.pgh.pa.us (8.9.3/8.9.3) with ESMTP id VAA21261;
- Sun, 14 May 2000 21:03:17 -0400 (EDT)
-To: Bruce Momjian <pgman@candle.pha.pa.us>
-cc: PostgreSQL-development <pgsql-hackers@postgreSQL.org>
-Subject: Re: [HACKERS] type conversion discussion
-In-reply-to: <20911.958339770@sss.pgh.pa.us>
-References: <200005141950.PAA04636@candle.pha.pa.us> <20911.958339770@sss.pgh.pa.us>
-Comments: In-reply-to Tom Lane <tgl@sss.pgh.pa.us>
- message dated "Sun, 14 May 2000 17:29:30 -0400"
-Date: Sun, 14 May 2000 21:03:17 -0400
-Message-ID: <21258.958352597@sss.pgh.pa.us>
-From: Tom Lane <tgl@sss.pgh.pa.us>
-Status: OR
-Here are the results of some further thoughts about type-conversion
-issues. This is not a complete proposal yet, but a sketch of an
-approach that might solve several of the gripes in my previous proposal.
-While thinking about this, I realized that my numeric-types proposal
-of yesterday would break at least a few cases that work nicely now.
-For example, I frequently do things like
- select * from pg_class where oid = 1234;
-whilst poking around in system tables and querytree dumps. If that
-constant is initially resolved as int2, as I suggested yesterday,
-then we have "oid = int2" for which there is no operator. To succeed
-we must decide to promote the constant to int4 --- but with no int4
-visible among the operands of the "=", it will not work to just "promote
-numerics to the highest type seen in the operands" as I suggested
-yesterday. So there has to be some more interaction in there.
-Anyway, I was complaining about the looseness of the concept of
-binary-compatible types and the fact that the parser's type conversion
-knowledge is mostly hardwired. These might be resolved by generalizing
-the numeric type hierarchy idea into a "type promotion lattice", which
-would work like this:
-* Add a "typpromote" column to pg_type, which contains either zero or
- the OID of another type that the parser is allowed to promote this
- type to when searching for usable functions/operators. For example,
- my numeric-types hierarchy of yesterday would be expressed by making
- int2 promote to int4, int4 to int8, int8 to numeric, numeric to
- float4, and float4 to float8. The promotion idea also replaces the
- current concept of binary-compatible types: for example, OID would
- link to int4 and varchar would link to text (but not vice versa!).
-* Also add a "typpromotebin" boolean column to pg_type, which contains
- 't' if the type conversion indicated by typpromote is "free", ie,
- no conversion function need be executed before regarding a value as
- belonging to the promoted type. This distinguishes binary-compatible
- from non-binary-compatible cases. If "typpromotebin" is 'f' and the
- parser decides it needs to apply the conversion, then it has to look
- up the appropriate conversion function in pg_proc. (More about this
- below.)
-Now, if the parser fails to find an exact match for a given function
-or operator name and the exact set of input data types, it proceeds by
-chasing up the promotion chains for the input data types and trying to
-locate a set of types for which there is a matching function/operator.
-If there are multiple possibilities, we choose the one which is the
-"least promoted" by some yet-to-be-determined metric. (This metric
-would probably favor "free" conversions over non-free ones, but other
-than that I'm not quite sure how it should work. The metric would
-replace a whole bunch of ad-hoc heuristics that are currently applied
-in the type resolver, so even if it seems rather ad-hoc it'd still be
-cleaner than what we have ;-).)
-In a situation like the "oid = int2" example above, this mechanism would
-presumably settle on "int4 = int4" as being the least-promoted
-equivalent operator. (It could not find "oid = oid" since there is
-no promotion path from int2 to oid.) That looks bad since it isn't
-compatible with an oidops index --- but I have a solution for that!
-I don't think we need the oid opclass at all; why shouldn't indexes
-on oid be expressed as int4 indexes to begin with? In general, if
-two types are considered binary-equivalent under the old scheme, then
-the one that is considered the subtype probably shouldn't have separate
-index operators under this new scheme. Instead it should just rely on
-the index operators of the promoted type.
-The point of the proposed typpromotebin field is to save a pg_proc
-lookup when trying to determine whether a particular promotion is "free"
-or not. We could save even more lookups if we didn't store the boolean
-but instead the actual OID of the conversion function, or zero if the
-promotion is "free". The trouble with that is that it creates a
-circularity problem when trying to define a new user type --- you can't
-define the conversion function if its input type doesn't exist yet.
-In any case, we want the parser to do a function lookup if we've
-advanced more than one step in the promotion hierarchy: if we've decided
-to promote int4 to float8 (which will be a four-step chain through int8,
-numeric, float4) we sure want the thing to use a direct int4tofloat8
-conversion function if available, not a chain of four conversion
-functions. So on balance I think we want to look in pg_proc once we've
-decided which conversion to perform. The only reason for having
-typpromotebin is that the promotion metric will want to know which
-conversions are free, and we don't want to have to do a lookup in
-pg_proc for each alternative we consider, only the ones that are finally
-selected to be used.
-I can think of at least one special case that still isn't cleanly
-handled under this scheme, and that is bpchar vs. varchar comparison.
-Currently, we have
-regression=# select 'a'::bpchar = 'a '::bpchar;
- ?column?
- t
-(1 row)
-This is correct since trailing blanks are insignificant in bpchar land,
-so the two values should be considered equal. If we try
-regression=# select 'a'::bpchar = 'a '::varchar;
-ERROR: Unable to identify an operator '=' for types 'bpchar' and 'varchar'
- You will have to retype this query using an explicit cast
-which is pretty bogus but at least it saves the system from making some
-random choice about whether bpchar or varchar comparison rules apply.
-On the other hand,
-regression=# select 'a'::bpchar = 'a '::text;
- ?column?
- f
-(1 row)
-Here the bpchar value has been promoted to text and then text comparison
-(where trailing blanks *are* significant) is applied. I'm not sure that
-we can really justify doing this in this case when we reject the bpchar
-vs varchar case, but maybe someone wants to argue that that's correct.
-The natural setup in my type-promotion scheme would be that both bpchar
-and varchar link to 'text' as their promoted type. If we do nothing
-special then text-style comparison would be used in a bpchar vs varchar
-comparison, which is arguably wrong.
-One way to deal with this without introducing kluges into the type
-resolver is to provide a full set of bpchar vs text and text vs bpchar
-operators, and make sure that the promotion metric is such that these
-will be used in place of text vs text operators if they apply (which
-should hold, I think, for any reasonable metric). This is probably
-the only way to get the "right" behavior in any case --- I think that
-the "right" behavior for such comparisons is to strip trailing blanks
-from the bpchar side but not the text/varchar side. (I haven't checked
-to see if SQL92 agrees, though.)
-Another issue is how to fit resolution of "unknown" literals into this
-scheme. We could probably continue to handle them more or less as we
-do now, but they might complicate the promotion metric.
-I am not clear yet on whether we'd still need the concept of "type
-categories" as they presently exist in the resolver. It's possible
-that we wouldn't, which would be a nice simplification. (If we do
-still need them, we should have a column in pg_type that defines the
-category of a type, instead of hard-wiring category assignments.)
- regards, tom lane
-From e99re41@DoCS.UU.SE Mon May 15 07:39:03 2000
-Received: from meryl.it.uu.se (root@meryl.it.uu.se [])
- by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id HAA10251
- for <pgman@candle.pha.pa.us>; Mon, 15 May 2000 07:39:01 -0400 (EDT)
-Received: from Zebra.DoCS.UU.SE (e99re41@Zebra.DoCS.UU.SE [])
- by meryl.it.uu.se (8.8.5/8.8.5) with ESMTP id NAA10849;
- Mon, 15 May 2000 13:39:45 +0200 (MET DST)
-Received: from localhost (e99re41@localhost) by Zebra.DoCS.UU.SE (8.6.12/8.6.12) with ESMTP id NAA26523; Mon, 15 May 2000 13:39:44 +0200
-X-Authentication-Warning: Zebra.DoCS.UU.SE: e99re41 owned process doing -bs
-Date: Mon, 15 May 2000 13:39:44 +0200 (MET DST)
-From: Peter Eisentraut <e99re41@DoCS.UU.SE>
-Reply-To: Peter Eisentraut <peter_e@gmx.net>
-To: Tom Lane <tgl@sss.pgh.pa.us>
-cc: Bruce Momjian <pgman@candle.pha.pa.us>,
- PostgreSQL-development <pgsql-hackers@postgresql.org>
-Subject: Re: [HACKERS] type conversion discussion
-In-Reply-To: <20911.958339770@sss.pgh.pa.us>
-Message-ID: <Pine.GSO.4.02A.10005151309020.26399-100000@Zebra.DoCS.UU.SE>
-MIME-Version: 1.0
-Content-Type: TEXT/PLAIN; charset=iso-8859-1
-Content-Transfer-Encoding: 8bit
-X-MIME-Autoconverted: from QUOTED-PRINTABLE to 8bit by candle.pha.pa.us id HAA10251
-Status: OR
-On Sun, 14 May 2000, Tom Lane wrote:
-> 1. Poor choice of type to attribute to numeric literals. (A possible
-> solution is sketched in my earlier message, but do we need similar
-> mechanisms for other type categories?)
-I think your plan looks good for the numerical land. (I'll ponder the oid
-issues in a second.) For other type categories, perhaps not. Should a line
-be promoted to a polygon so you can check if it contains a point? Or a
-polygon to a box? Higher dimensions? :-)
-> 2. Tensions between treating string literals as "unknown" type and
-> as "text" type, per this thread so far.
-Yes, while we're at it, let's look at this in detail. I claim that
-something of the form 'xxx' should always be text (or char or whatever),
-period. Let's consider the cases were this could potentially clash with
-the current behaviour:
-a) The target type is unambiguously clear, e.g., UPDATE ... SET. Then you
-cast text to the target type. The effect is identical.
-b) The target type is completely unspecified, e.g. CREATE TABLE AS SELECT
-'xxx'; This will currently create an "unknown" column. It should arguably
-create a "text" column.
-Function argument resolution:
-c) There is only one function and it has a "text" argument. No-brainer.
-d) There is only one function and it has an argument other than text. Try
-to cast text to that type. (This is what's done in general, isn't it?)
-e) The function is overloaded for many types, amongst which is text. Then
-call the text version. I believe this would currently fail, which I'd
-consider a deficiency.
-f) The function is overloaded for many types, none of which is text. In
-that case you have to cast anyway, so you don't lose anything.
-On thing to also keep in mind regarding required casting for (b) and (f)
-is that SQL never allowed literals of "fancy" types (e.g., DATE) to have
-undecorated 'yyyy-mm-dd' constants, you always have to say DATE
-'yyyy-mm-dd'. What Postgres allows is a convencience where DATE would be
-obvious or implied. In the end it's a win-win situation: you tell the
-system what you want, and your code is clearer.
-> 3. IS_BINARY_COMPATIBLE seems like a bogus concept.
-At least it's bogus when used for types which are not actually binary
-compatible, e.g. int4 and oid. The result of the current implementation is
-that you can perfectly happily insert and retrieve negative numbers from
-oid fields.
-I'm not so sure about the value of this particular equivalency anyway.
-AFAICS the only functions that make sense for oids are comparisons (incl.
-min, max), adding integers to them, subtracting one oid from another.
-Silent mangling with int4 means that you can multiply them, square them,
-add floating point numbers to them (doesn't really work in practice
-though), all things that have no business with oids.
-I'd say define the operators that are useful for oids explicitly for oids
-and require casts for all others, so the users know what they're doing.
-The fact that an oid is also a number should be an implementation detail.
-In my mind oids are like pointers in C. Indiscriminate mangling of
-pointers and integers in C has long been dismissed as questionable coding.
-Of course I'd be very willing to consider counterexamples to these
-theories ...
-Peter Eisentraut Sernanders väg 10:115
-peter_e@gmx.net 75262 Uppsala
-http://yi.org/peter-e/ Sweden
-From tgl@sss.pgh.pa.us Tue Jun 13 04:58:20 2000
-Received: from sss2.sss.pgh.pa.us (sss.pgh.pa.us [])
- by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id DAA24281
- for <pgman@candle.pha.pa.us>; Tue, 13 Jun 2000 03:58:18 -0400 (EDT)
-Received: from sss2.sss.pgh.pa.us (tgl@localhost [])
- by sss2.sss.pgh.pa.us (8.9.3/8.9.3) with ESMTP id DAA02571;
- Tue, 13 Jun 2000 03:58:43 -0400 (EDT)
-To: Bruce Momjian <pgman@candle.pha.pa.us>
-cc: pgsql-hackers@postgresql.org
-Subject: Re: [HACKERS] Proposal for fixing numeric type-resolution issues
-In-reply-to: <200006130741.DAA23502@candle.pha.pa.us>
-References: <200006130741.DAA23502@candle.pha.pa.us>
-Comments: In-reply-to Bruce Momjian <pgman@candle.pha.pa.us>
- message dated "Tue, 13 Jun 2000 03:41:56 -0400"
-Date: Tue, 13 Jun 2000 03:58:43 -0400
-Message-ID: <2568.960883123@sss.pgh.pa.us>
-From: Tom Lane <tgl@sss.pgh.pa.us>
-Status: OR
-Bruce Momjian <pgman@candle.pha.pa.us> writes:
-> Again, anything to add to the TODO here?
-IIRC, there was some unhappiness with the proposal you quote, so I'm
-not sure we've quite agreed what to do... but clearly something must
-be done.
- regards, tom lane
->> We've got a collection of problems that are related to the parser's
->> inability to make good type-resolution choices for numeric constants.
->> In some cases you get a hard error; for example "NumericVar + 4.4"
->> yields
->> ERROR: Unable to identify an operator '+' for types 'numeric' and 'float8'
->> You will have to retype this query using an explicit cast
->> because "4.4" is initially typed as float8 and the system can't figure
->> out whether to use numeric or float8 addition. A more subtle problem
->> is that a query like "... WHERE Int2Var < 42" is unable to make use of
->> an index on the int2 column: 42 is resolved as int4, so the operator
->> is int24lt, which works but is not in the opclass of an int2 index.
->> Here is a proposal for fixing these problems. I think we could get this
->> done for 7.1 if people like it.
->> The basic problem is that there's not enough smarts in the type resolver
->> about the interrelationships of the numeric datatypes. All it has is
->> a concept of a most-preferred type within the category of numeric types.
->> (We are abusing the most-preferred-type mechanism, BTW, because both
->> FLOAT8 and NUMERIC claim to be the most-preferred type in the numeric
->> category! This is in fact why the resolver can't make a choice for
->> "numeric+float8".) We need more intelligence than that.
->> I propose that we set up a strictly-ordered hierarchy of numeric
->> datatypes, running from least preferred to most preferred:
->> int2, int4, int8, numeric, float4, float8.
->> Rather than simply considering coercions to the most-preferred type,
->> the type resolver should use the following rules:
->> 1. No value will be down-converted (eg int4 to int2) except by an
->> explicit conversion.
->> 2. If there is not an exact matching operator, numeric values will be
->> up-converted to the highest numeric datatype present among the operator
->> or function's arguments. For example, given "int2 + int8" we'd up-
->> convert the int2 to int8 and apply int8 addition.
->> The final piece of the puzzle is that the type initially assigned to
->> an undecorated numeric constant should be NUMERIC if it contains a
->> decimal point or exponent, and otherwise the smallest of int2, int4,
->> int8, NUMERIC that will represent it. This is a considerable change
->> from the current lexer behavior, where you get either int4 or float8.
->> For example, given "NumericVar + 4.4", the constant 4.4 will initially
->> be assigned type NUMERIC, we will resolve the operator as numeric plus,
->> and everything's fine. Given "Float8Var + 4.4", the constant is still
->> initially numeric, but will be up-converted to float8 so that float8
->> addition can be used. The end result is the same as in traditional
->> Postgres: you get float8 addition. Given "Int2Var < 42", the constant
->> is initially typed as int2, since it fits, and we end up selecting
->> int2lt, thereby allowing use of an int2 index. (On the other hand,
->> given "Int2Var < 100000", we'd end up using int4lt, which is correct
->> to avoid overflow.)
->> A couple of crucial subtleties here:
->> 1. We are assuming that the parser or optimizer will constant-fold
->> any conversion functions that are introduced. Thus, in the
->> "Float8Var + 4.4" case, the 4.4 is represented as a float8 4.4 by the
->> time execution begins, so there's no performance loss.
->> 2. We cannot lose precision by initially representing a constant as
->> numeric and later converting it to float. Nor can we exceed NUMERIC's
->> range (the default 1000-digit limit is more than the range of IEEE
->> float8 data). It would not work as well to start out by representing
->> a constant as float and then converting it to numeric.
->> Presently, the pg_proc and pg_operator tables contain a pretty fair
->> collection of cross-datatype numeric operators, such as int24lt,
->> float48pl, etc. We could perhaps leave these in, but I believe that
->> it is better to remove them. For example, if int42lt is left in place,
->> then it would capture cases like "Int4Var < 42", whereas we need that
->> to be translated to int4lt so that an int4 index can be used. Removing
->> these operators will eliminate some code bloat and system-catalog bloat
->> to boot.
->> As far as I can tell, this proposal is almost compatible with the rules
->> given in SQL92: in particular, SQL92 specifies that an operator having
->> both "approximate numeric" (float) and "exact numeric" (int or numeric)
->> inputs should deliver an approximate-numeric result. I propose
->> deviating from SQL92 in a single respect: SQL92 specifies that a
->> constant containing an exponent (eg 1.2E34) is approximate numeric,
->> which implies that the result of an operator using it is approximate
->> even if the other operand is exact. I believe it's better to treat
->> such a constant as exact (ie, type NUMERIC) and only convert it to
->> float if the other operand is float. Without doing that, an assignment
->> like
->> UPDATE tab SET NumericVar = 1.234567890123456789012345E34;
->> will not work as desired because the constant will be prematurely
->> coerced to float, causing precision loss.
->> Comments?
->> regards, tom lane
-> --
-> Bruce Momjian | http://www.op.net/~candle
-> pgman@candle.pha.pa.us | (610) 853-3000
-> + If your life is a hard drive, | 830 Blythe Avenue
-> + Christ can be your backup. | Drexel Hill, Pennsylvania 19026
-From tgl@sss.pgh.pa.us Mon Jun 12 14:09:45 2000
-Received: from sss2.sss.pgh.pa.us (sss.pgh.pa.us [])
- by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id NAA01993
- for <pgman@candle.pha.pa.us>; Mon, 12 Jun 2000 13:09:43 -0400 (EDT)
-Received: from sss2.sss.pgh.pa.us (tgl@localhost [])
- by sss2.sss.pgh.pa.us (8.9.3/8.9.3) with ESMTP id NAA01515;
- Mon, 12 Jun 2000 13:10:01 -0400 (EDT)
-To: Peter Eisentraut <peter_e@gmx.net>
-cc: Bruce Momjian <pgman@candle.pha.pa.us>,
- "Thomas G. Lockhart" <lockhart@alumni.caltech.edu>,
- PostgreSQL-development <pgsql-hackers@postgresql.org>
-Subject: Re: [HACKERS] Adding time to DATE type
-In-reply-to: <Pine.LNX.4.21.0006110322150.9195-100000@localhost.localdomain>
-References: <Pine.LNX.4.21.0006110322150.9195-100000@localhost.localdomain>
-Comments: In-reply-to Peter Eisentraut <peter_e@gmx.net>
- message dated "Sun, 11 Jun 2000 13:41:24 +0200"
-Date: Mon, 12 Jun 2000 13:10:00 -0400
-Message-ID: <1512.960829800@sss.pgh.pa.us>
-From: Tom Lane <tgl@sss.pgh.pa.us>
-Status: ORr
-Peter Eisentraut <peter_e@gmx.net> writes:
-> Bruce Momjian writes:
->> Can someone give me a TODO summary for this issue?
-> * make 'text' constants default to text type (not unknown)
-> (I think not everyone's completely convinced on this issue, but I don't
-> recall anyone being firmly opposed to it.)
-It would be a mistake to eliminate the distinction between unknown and
-text. See for example my just-posted response to John Cochran on
-pgsql-general about why 'BOULEVARD'::text behaves differently from
-'BOULEVARD'::char. If string literals are immediately assigned type
-text then we will have serious problems with char(n) fields.
-I think it's fine to assign string literals a type of 'unknown'
-initially. What we need to do is add a phase of type resolution that
-considers treating them as text, but only after the existing logic fails
-to deduce a type.
-(BTW it might be better to treat string literals as defaulting to char(n)
-instead of text, allowing the normal promotion rules to replace char(n)
-with text if necessary. Not sure if that would make things more or less
-confusing for operations that intermix fixed- and variable-width char
- regards, tom lane
-From pgsql-hackers-owner+M1936@postgresql.org Sun Dec 10 13:17:54 2000
-Received: from mail.postgresql.org (webmail.postgresql.org [])
- by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id NAA20676
- for <pgman@candle.pha.pa.us>; Sun, 10 Dec 2000 13:17:54 -0500 (EST)
-Received: from mail.postgresql.org (webmail.postgresql.org [])
- by mail.postgresql.org (8.11.1/8.11.1) with SMTP id eBAIGvZ40566;
- Sun, 10 Dec 2000 13:16:57 -0500 (EST)
- (envelope-from pgsql-hackers-owner+M1936@postgresql.org)
-Received: from sss.pgh.pa.us (sss.pgh.pa.us [])
- by mail.postgresql.org (8.11.1/8.11.1) with ESMTP id eBAI8HZ39820
- for <pgsql-hackers@postgreSQL.org>; Sun, 10 Dec 2000 13:08:17 -0500 (EST)
- (envelope-from tgl@sss.pgh.pa.us)
-Received: from sss2.sss.pgh.pa.us (tgl@localhost [])
- by sss.pgh.pa.us (8.11.1/8.11.1) with ESMTP id eBAI82o28682;
- Sun, 10 Dec 2000 13:08:02 -0500 (EST)
-To: Thomas Lockhart <lockhart@alumni.caltech.edu>
-cc: pgsql-hackers@postgresql.org
-Subject: [HACKERS] Unknown-type resolution rules, redux
-Date: Sun, 10 Dec 2000 13:08:02 -0500
-Message-ID: <28679.976471682@sss.pgh.pa.us>
-From: Tom Lane <tgl@sss.pgh.pa.us>
-Precedence: bulk
-Sender: pgsql-hackers-owner@postgresql.org
-Status: OR
-parse_coerce.c contains the following conversation --- I believe the
-first XXX comment is from me and the second from you:
- /*
- * Still too many candidates? Try assigning types for the unknown
- * columns.
- *
- * We do this by examining each unknown argument position to see if all
- * the candidates agree on the type category of that slot. If so, and
- * if some candidates accept the preferred type in that category,
- * eliminate the candidates with other input types. If we are down to
- * one candidate at the end, we win.
- *
- * XXX It's kinda bogus to do this left-to-right, isn't it? If we
- * eliminate some candidates because they are non-preferred at the
- * first slot, we won't notice that they didn't have the same type
- * category for a later slot.
- * XXX Hmm. How else would you do this? These candidates are here because
- * they all have the same number of matches on arguments with explicit
- * types, so from here on left-to-right resolution is as good as any.
- * Need a counterexample to see otherwise...
- */
-The comment is out of date anyway because it fails to mention the new
-rule about preferring STRING category. But to answer your request for
-a counterexample: consider
- SELECT foo('bar', 'baz')
-First, suppose the available candidates are
- foo(float8, int4)
- foo(float8, point)
-In this case, we examine the first argument position, see that all the
-candidates agree on NUMERIC category, so we consider resolving the first
-unknown input to float8. That eliminates neither candidate so we move
-on to the second argument position. Here there is a conflict of
-categories so we can't eliminate anything, and we decide the call is
-ambiguous. That's correct (or at least Operating As Designed ;-)).
-But now suppose we have
- foo(float8, int4)
- foo(float4, point)
-Here, at the first position we will still see that all candidates agree
-on NUMERIC category, and then we will eliminate candidate 2 because it
-isn't the preferred type in that category. Now when we come to the
-second argument position, there's only one candidate left so there's
-no category conflict. Result: this call is considered non-ambiguous.
-This means there is a left-to-right bias in the algorithm. For example,
-the exact same call *would* be considered ambiguous if the candidates'
-argument orders were reversed:
- foo(int4, float8)
- foo(point, float4)
-I do not like that. You could maybe argue that earlier arguments are
-more important than later ones for functions, but it's harder to make
-that case for binary operators --- and in any case this behavior is
-extremely difficult to explain in prose.
-To fix this, I think we need to split the loop into two passes.
-The first pass does *not* remove any candidates. What it does is to
-look separately at each UNKNOWN-argument position and attempt to deduce
-a probable category for it, using the following rules:
-* If any candidate has an input type of STRING category, use STRING
-category; else if all candidates agree on the category, use that
-category; else fail because no resolution can be made.
-* The first pass must also remember whether any candidates are of a
-preferred type within the selected category.
-The probable categories and exists-preferred-type booleans are saved in
-local arrays. (Note this has to be done this way because
-IsPreferredType currently allows more than one type to be considered
-preferred in a category ... so the first pass cannot try to determine a
-unique type, only a category.)
-If we find a category for every UNKNOWN arg, then we enter a second loop
-in which we discard candidates. In this pass we discard a candidate if
-(a) it is of the wrong category, or (b) it is of the right category but
-is not of preferred type in that category, *and* we found candidate(s)
-of preferred type at this slot.
-If we end with exactly one candidate then we win.
-It is clear in this algorithm that there is no order dependency: the
-conditions for keeping or discarding a candidate are fixed before we
-start the second pass, and do not vary depending on which other
-candidates were discarded before it.
- regards, tom lane
-From pgsql-general-owner+M18949=candle.pha.pa.us=pgman@postgresql.org Sat Dec 29 15:47:47 2001
-Return-path: <pgsql-general-owner+M18949=candle.pha.pa.us=pgman@postgresql.org>
-Received: from rs.postgresql.org (server1.pgsql.org [] (may be forged))
- by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id fBTKlkT05111
- for <pgman@candle.pha.pa.us>; Sat, 29 Dec 2001 15:47:46 -0500 (EST)
-Received: from postgresql.org (postgresql.org [])
- by rs.postgresql.org (8.11.6/8.11.6) with ESMTP id fBTKhZN74322
- for <pgman@candle.pha.pa.us>; Sat, 29 Dec 2001 14:43:35 -0600 (CST)
- (envelope-from pgsql-general-owner+M18949=candle.pha.pa.us=pgman@postgresql.org)
-Received: from candle.pha.pa.us (216-55-132-35.dsl.san-diego.abac.net [])
- by postgresql.org (8.11.3/8.11.4) with ESMTP id fBTKaem38452
- for <pgsql-general@postgresql.org>; Sat, 29 Dec 2001 15:36:40 -0500 (EST)
- (envelope-from pgman@candle.pha.pa.us)
-Received: (from pgman@localhost)
- by candle.pha.pa.us (8.11.6/8.10.1) id fBTKaTg04256;
- Sat, 29 Dec 2001 15:36:29 -0500 (EST)
-From: Bruce Momjian <pgman@candle.pha.pa.us>
-Message-ID: <200112292036.fBTKaTg04256@candle.pha.pa.us>
-Subject: Re: [GENERAL] Casting Varchar to Numeric
-In-Reply-To: <20011206150158.O28880-100000@megazone23.bigpanda.com>
-To: Stephan Szabo <sszabo@megazone23.bigpanda.com>
-Date: Sat, 29 Dec 2001 15:36:29 -0500 (EST)
-cc: Andy Marden <amarden@usa.net>, pgsql-general@postgresql.org
-X-Mailer: ELM [version 2.4ME+ PL96 (25)]
-MIME-Version: 1.0
-Content-Transfer-Encoding: 7bit
-Content-Type: text/plain; charset=US-ASCII
-Precedence: bulk
-Sender: pgsql-general-owner@postgresql.org
-Status: OR
-> On Mon, 3 Dec 2001, Andy Marden wrote:
-> > Martijn,
-> >
-> > It does work (believe it or not). I've now tried the method you mention
-> > below - that also works and is much nicer. I can't believe that PostgreSQL
-> > can't work this out. Surely implementing an algorithm that understands that
-> > if you can go from a ->b and b->c then you can certainly go from a->c. If
-> It's more complicated than that (and postgres does some of this but not
-> all), for example the cast text->float8->numeric potentially loses
-> precision and should probably not be an automatic cast for that reason.
-> > this is viewed as too complex a task for the internals - at least a diagram
-> > or some way of understanding how you should go from a->c would be immensely
-> > helpful wouldn't it! Daunting for anyone picking up the database and trying
-> > to do something simple(!)
-> There may be a need for documentation on this. Would you like to write
-> some ;)
-OK, I ran some tests:
- test=> create table test (x text);
- test=> insert into test values ('323');
- INSERT 5122745 1
- test=> select cast (x as numeric) from test;
- ERROR: Cannot cast type 'text' to 'numeric'
-I can see problems with automatically casting numeric to text because
-you have to guess the desired format, but going from text to numeric
-seems quite easy to do. Is there a reason we don't do it?
-I can cast to integer and float8 fine:
- test=> select cast ( x as integer) from test;
- ?column?
- ----------
- 323
- (1 row)
- test=> select cast ( x as float8) from test;
- ?column?
- ----------
- 323
- (1 row)
- Bruce Momjian | http://candle.pha.pa.us
- pgman@candle.pha.pa.us | (610) 853-3000
- + If your life is a hard drive, | 830 Blythe Avenue
- + Christ can be your backup. | Drexel Hill, Pennsylvania 19026
----------------------------(end of broadcast)---------------------------
-TIP 2: you can get off all lists at once with the unregister command
- (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
-From pgsql-general-owner+M18951=candle.pha.pa.us=pgman@postgresql.org Sat Dec 29 19:10:38 2001
-Return-path: <pgsql-general-owner+M18951=candle.pha.pa.us=pgman@postgresql.org>
-Received: from west.navpoint.com (west.navpoint.com [])
- by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id fBU0AbT23972
- for <pgman@candle.pha.pa.us>; Sat, 29 Dec 2001 19:10:37 -0500 (EST)
-Received: from rs.postgresql.org (server1.pgsql.org [] (may be forged))
- by west.navpoint.com (8.11.6/8.10.1) with ESMTP id fBTNVj008959
- for <pgman@candle.pha.pa.us>; Sat, 29 Dec 2001 18:31:45 -0500 (EST)
-Received: from postgresql.org (postgresql.org [])
- by rs.postgresql.org (8.11.6/8.11.6) with ESMTP id fBTNQrN78655
- for <pgman@candle.pha.pa.us>; Sat, 29 Dec 2001 17:26:53 -0600 (CST)
- (envelope-from pgsql-general-owner+M18951=candle.pha.pa.us=pgman@postgresql.org)
-Received: from sss.pgh.pa.us ([])
- by postgresql.org (8.11.3/8.11.4) with ESMTP id fBTN8Fm47978
- for <pgsql-general@postgresql.org>; Sat, 29 Dec 2001 18:08:15 -0500 (EST)
- (envelope-from tgl@sss.pgh.pa.us)
-Received: from sss2.sss.pgh.pa.us (tgl@localhost [])
- by sss.pgh.pa.us (8.11.4/8.11.4) with ESMTP id fBTN7vg20245;
- Sat, 29 Dec 2001 18:07:57 -0500 (EST)
-To: Bruce Momjian <pgman@candle.pha.pa.us>
-cc: Stephan Szabo <sszabo@megazone23.bigpanda.com>,
- Andy Marden <amarden@usa.net>, pgsql-general@postgresql.org
-Subject: Re: [GENERAL] Casting Varchar to Numeric
-In-Reply-To: <200112292036.fBTKaTg04256@candle.pha.pa.us>
-References: <200112292036.fBTKaTg04256@candle.pha.pa.us>
-Comments: In-reply-to Bruce Momjian <pgman@candle.pha.pa.us>
- message dated "Sat, 29 Dec 2001 15:36:29 -0500"
-Date: Sat, 29 Dec 2001 18:07:57 -0500
-Message-ID: <20242.1009667277@sss.pgh.pa.us>
-From: Tom Lane <tgl@sss.pgh.pa.us>
-Precedence: bulk
-Sender: pgsql-general-owner@postgresql.org
-Status: OR
-Bruce Momjian <pgman@candle.pha.pa.us> writes:
-> I can see problems with automatically casting numeric to text because
-> you have to guess the desired format, but going from text to numeric
-> seems quite easy to do. Is there a reason we don't do it?
-I do not think it's a good idea to have implicit casts between text and
-everything under the sun, because that essentially destroys the type
-checking system. What we need (see previous discussion) is a flag in
-pg_proc that says whether a type conversion function may be invoked
-implicitly or not. I've got no problem with offering text(numeric) and
-numeric(text) functions that are invoked by explicit function calls or
-casts --- I just don't want the system trying to use them to make
-sense of a bogus query.
-> I can cast to integer and float8 fine:
-I don't believe that those should be available as implicit casts either.
-They are, at the moment:
-regression=# select 33 || 44.0;
- ?column?
- 3344
-(1 row)
- regards, tom lane
----------------------------(end of broadcast)---------------------------
-TIP 6: Have you searched our list archives?
+++ /dev/null
-From Inoue@tpf.co.jp Tue Jan 18 19:08:30 2000
-Received: from sd.tpf.co.jp (sd.tpf.co.jp [])
- by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id UAA10148
- for <pgman@candle.pha.pa.us>; Tue, 18 Jan 2000 20:08:27 -0500 (EST)
-Received: from cadzone ([] (may be forged))
- by sd.tpf.co.jp (2.5 Build 2640 (Berkeley 8.8.6)/8.8.4) with SMTP
- id KAA02790; Wed, 19 Jan 2000 10:08:02 +0900
-From: "Hiroshi Inoue" <Inoue@tpf.co.jp>
-To: "Bruce Momjian" <pgman@candle.pha.pa.us>
-Cc: "pgsql-hackers" <pgsql-hackers@postgreSQL.org>
-Subject: RE: [HACKERS] Index recreation in vacuum
-Date: Wed, 19 Jan 2000 10:13:40 +0900
-Message-ID: <000201bf621a$6b9baf20$2801007e@tpf.co.jp>
-MIME-Version: 1.0
-Content-Type: text/plain;
- charset="iso-8859-1"
-Content-Transfer-Encoding: 7bit
-X-Priority: 3 (Normal)
-X-MSMail-Priority: Normal
-X-Mailer: Microsoft Outlook 8.5, Build 4.71.2173.0
-X-MimeOLE: Produced By Microsoft MimeOLE V5.00.2314.1300
-Importance: Normal
-In-Reply-To: <200001181821.NAA02988@candle.pha.pa.us>
-Status: ROr
-> -----Original Message-----
-> From: Bruce Momjian [mailto:pgman@candle.pha.pa.us]
-> [Charset iso-8859-1 unsupported, filtering to ASCII...]
-> > Hi all,
-> >
-> > I'm trying to implement REINDEX command.
-> >
-> > REINDEX operation itself is available everywhere and
-> > I've thought about applying it to VACUUM.
-> That is a good idea. Vacuuming of indexes can be very slow.
-> > .
-> > My plan is as follows.
-> >
-> > Add a new option to force index recreation in vacuum
-> > and if index recreation is specified.
-> Couldn't we auto-recreate indexes based on the number of tuples moved by
-> vacuum,
-Yes,we could probably do it. But I'm not sure the availability of new
-New vacuum would give us a big advantage that
-1) Much faster than current if vacuum remove/moves many tuples.
-2) Does shrink index files
-But in case of abort/crash
-1) couldn't choose index scan for the table
-2) unique constraints of the table would be lost
-I don't know how people estimate this disadvantage.
-> > Now I'm inclined to use relhasindex of pg_class to
-> > validate/invalidate indexes of a table at once.
-> There are a few calls to CatalogIndexInsert() that know the
-> system table they
-> are using and know it has indexes, so it does not check that field. You
-> could add cases for that.
-I think there aren't so many places to check.
-I would examine it if my idea is OK.
-Hiroshi Inoue
-From owner-pgsql-hackers@hub.org Tue Jan 18 19:15:27 2000
-Received: from hub.org (hub.org [])
- by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id UAA10454
- for <pgman@candle.pha.pa.us>; Tue, 18 Jan 2000 20:15:26 -0500 (EST)
-Received: from localhost (majordom@localhost)
- by hub.org (8.9.3/8.9.3) with SMTP id UAA42280;
- Tue, 18 Jan 2000 20:10:35 -0500 (EST)
- (envelope-from owner-pgsql-hackers)
-Received: by hub.org (bulk_mailer v1.5); Tue, 18 Jan 2000 20:10:30 -0500
-Received: (from majordom@localhost)
- by hub.org (8.9.3/8.9.3) id UAA42081
- for pgsql-hackers-outgoing; Tue, 18 Jan 2000 20:09:31 -0500 (EST)
- (envelope-from owner-pgsql-hackers@postgreSQL.org)
-Received: from sd.tpf.co.jp (sd.tpf.co.jp [])
- by hub.org (8.9.3/8.9.3) with ESMTP id UAA41943
- for <pgsql-hackers@postgreSQL.org>; Tue, 18 Jan 2000 20:08:39 -0500 (EST)
- (envelope-from Inoue@tpf.co.jp)
-Received: from cadzone ([] (may be forged))
- by sd.tpf.co.jp (2.5 Build 2640 (Berkeley 8.8.6)/8.8.4) with SMTP
- id KAA02790; Wed, 19 Jan 2000 10:08:02 +0900
-From: "Hiroshi Inoue" <Inoue@tpf.co.jp>
-To: "Bruce Momjian" <pgman@candle.pha.pa.us>
-Cc: "pgsql-hackers" <pgsql-hackers@postgreSQL.org>
-Subject: RE: [HACKERS] Index recreation in vacuum
-Date: Wed, 19 Jan 2000 10:13:40 +0900
-Message-ID: <000201bf621a$6b9baf20$2801007e@tpf.co.jp>
-MIME-Version: 1.0
-Content-Type: text/plain;
- charset="iso-8859-1"
-Content-Transfer-Encoding: 7bit
-X-Priority: 3 (Normal)
-X-MSMail-Priority: Normal
-X-Mailer: Microsoft Outlook 8.5, Build 4.71.2173.0
-X-MimeOLE: Produced By Microsoft MimeOLE V5.00.2314.1300
-Importance: Normal
-In-Reply-To: <200001181821.NAA02988@candle.pha.pa.us>
-Sender: owner-pgsql-hackers@postgreSQL.org
-Status: RO
-> -----Original Message-----
-> From: Bruce Momjian [mailto:pgman@candle.pha.pa.us]
-> [Charset iso-8859-1 unsupported, filtering to ASCII...]
-> > Hi all,
-> >
-> > I'm trying to implement REINDEX command.
-> >
-> > REINDEX operation itself is available everywhere and
-> > I've thought about applying it to VACUUM.
-> That is a good idea. Vacuuming of indexes can be very slow.
-> > .
-> > My plan is as follows.
-> >
-> > Add a new option to force index recreation in vacuum
-> > and if index recreation is specified.
-> Couldn't we auto-recreate indexes based on the number of tuples moved by
-> vacuum,
-Yes,we could probably do it. But I'm not sure the availability of new
-New vacuum would give us a big advantage that
-1) Much faster than current if vacuum remove/moves many tuples.
-2) Does shrink index files
-But in case of abort/crash
-1) couldn't choose index scan for the table
-2) unique constraints of the table would be lost
-I don't know how people estimate this disadvantage.
-> > Now I'm inclined to use relhasindex of pg_class to
-> > validate/invalidate indexes of a table at once.
-> There are a few calls to CatalogIndexInsert() that know the
-> system table they
-> are using and know it has indexes, so it does not check that field. You
-> could add cases for that.
-I think there aren't so many places to check.
-I would examine it if my idea is OK.
-Hiroshi Inoue
-From owner-pgsql-hackers@hub.org Tue Jan 18 19:57:21 2000
-Received: from hub.org (hub.org [])
- by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id UAA11764
- for <pgman@candle.pha.pa.us>; Tue, 18 Jan 2000 20:57:19 -0500 (EST)
-Received: from localhost (majordom@localhost)
- by hub.org (8.9.3/8.9.3) with SMTP id UAA50653;
- Tue, 18 Jan 2000 20:52:38 -0500 (EST)
- (envelope-from owner-pgsql-hackers)
-Received: by hub.org (bulk_mailer v1.5); Tue, 18 Jan 2000 20:52:30 -0500
-Received: (from majordom@localhost)
- by hub.org (8.9.3/8.9.3) id UAA50513
- for pgsql-hackers-outgoing; Tue, 18 Jan 2000 20:51:32 -0500 (EST)
- (envelope-from owner-pgsql-hackers@postgreSQL.org)
-Received: from candle.pha.pa.us (pgman@s5-03.ppp.op.net [])
- by hub.org (8.9.3/8.9.3) with ESMTP id UAA50462
- for <pgsql-hackers@postgreSQL.org>; Tue, 18 Jan 2000 20:51:06 -0500 (EST)
- (envelope-from pgman@candle.pha.pa.us)
-Received: (from pgman@localhost)
- by candle.pha.pa.us (8.9.0/8.9.0) id UAA11421;
- Tue, 18 Jan 2000 20:50:50 -0500 (EST)
-From: Bruce Momjian <pgman@candle.pha.pa.us>
-Message-Id: <200001190150.UAA11421@candle.pha.pa.us>
-Subject: Re: [HACKERS] Index recreation in vacuum
-In-Reply-To: <000201bf621a$6b9baf20$2801007e@tpf.co.jp> from Hiroshi Inoue at
- "Jan 19, 2000 10:13:40 am"
-To: Hiroshi Inoue <Inoue@tpf.co.jp>
-Date: Tue, 18 Jan 2000 20:50:50 -0500 (EST)
-CC: pgsql-hackers <pgsql-hackers@postgreSQL.org>
-X-Mailer: ELM [version 2.4ME+ PL66 (25)]
-MIME-Version: 1.0
-Content-Type: text/plain; charset=US-ASCII
-Content-Transfer-Encoding: 7bit
-Sender: owner-pgsql-hackers@postgreSQL.org
-Status: ROr
-> > > Add a new option to force index recreation in vacuum
-> > > and if index recreation is specified.
-> >
-> > Couldn't we auto-recreate indexes based on the number of tuples moved by
-> > vacuum,
-> Yes,we could probably do it. But I'm not sure the availability of new
-> vacuum.
-> New vacuum would give us a big advantage that
-> 1) Much faster than current if vacuum remove/moves many tuples.
-> 2) Does shrink index files
-> But in case of abort/crash
-> 1) couldn't choose index scan for the table
-> 2) unique constraints of the table would be lost
-> I don't know how people estimate this disadvantage.
-That's why I was recommending rename(). The actual window of
-vunerability goes from perhaps hours to fractions of a second.
-In fact, if I understand this right, you could make the vulerability
-zero by just performing the rename as one operation.
-In fact, for REINDEX cases where you don't have a lock on the entire
-table as you do in vacuum, you could reindex the table with a simple
-read-lock on the base table and index, and move the new index into place
-with the users seeing no change. Only people traversing the index
-during the change would have a problem. You just need an exclusive
-access on the index for the duration of the rename() so no one is
-traversing the index during the rename().
-Destroying the index and recreating opens a large time span that there
-is no index, and you have to jury-rig something so people don't try to
-use the index. With rename() you just put the new index in place with
-one operation. Just don't let people traverse the index during the
-change. The pointers to the heap tuples is the same in both indexes.
-In fact, with WAL, we will allow multiple physical files for the same
-table by appending the table oid to the file name. In this case, the
-old index could be deleted by rename, and people would continue to use
-the old index until they deleted the open file pointers. Not sure how
-this works in practice because new tuples would not be inserted into the
-old copy of the index.
- Bruce Momjian | http://www.op.net/~candle
- pgman@candle.pha.pa.us | (610) 853-3000
- + If your life is a hard drive, | 830 Blythe Avenue
- + Christ can be your backup. | Drexel Hill, Pennsylvania 19026
-From pgman Tue Jan 18 20:04:11 2000
-Received: (from pgman@localhost)
- by candle.pha.pa.us (8.9.0/8.9.0) id VAA11990;
- Tue, 18 Jan 2000 21:04:11 -0500 (EST)
-From: Bruce Momjian <pgman>
-Message-Id: <200001190204.VAA11990@candle.pha.pa.us>
-Subject: Re: [HACKERS] Index recreation in vacuum
-In-Reply-To: <200001190150.UAA11421@candle.pha.pa.us> from Bruce Momjian at "Jan
- 18, 2000 08:50:50 pm"
-To: Bruce Momjian <pgman@candle.pha.pa.us>
-Date: Tue, 18 Jan 2000 21:04:11 -0500 (EST)
-CC: Hiroshi Inoue <Inoue@tpf.co.jp>,
- pgsql-hackers <pgsql-hackers@postgreSQL.org>
-X-Mailer: ELM [version 2.4ME+ PL66 (25)]
-MIME-Version: 1.0
-Content-Type: text/plain; charset=US-ASCII
-Content-Transfer-Encoding: 7bit
-Status: RO
-> > I don't know how people estimate this disadvantage.
-> That's why I was recommending rename(). The actual window of
-> vunerability goes from perhaps hours to fractions of a second.
-> In fact, if I understand this right, you could make the vulerability
-> zero by just performing the rename as one operation.
-> In fact, for REINDEX cases where you don't have a lock on the entire
-> table as you do in vacuum, you could reindex the table with a simple
-> read-lock on the base table and index, and move the new index into place
-> with the users seeing no change. Only people traversing the index
-> during the change would have a problem. You just need an exclusive
-> access on the index for the duration of the rename() so no one is
-> traversing the index during the rename().
-> Destroying the index and recreating opens a large time span that there
-> is no index, and you have to jury-rig something so people don't try to
-> use the index. With rename() you just put the new index in place with
-> one operation. Just don't let people traverse the index during the
-> change. The pointers to the heap tuples is the same in both indexes.
-> In fact, with WAL, we will allow multiple physical files for the same
-> table by appending the table oid to the file name. In this case, the
-> old index could be deleted by rename, and people would continue to use
-> the old index until they deleted the open file pointers. Not sure how
-> this works in practice because new tuples would not be inserted into the
-> old copy of the index.
-Maybe I am all wrong here. Maybe most of the advantage of rename() are
-meaningless with reindex using during vacuum, which is the most
-important use of reindex.
-Let's look at index using during vacuum. Right now, how does vacuum
-handle indexes when it moves a tuple? Does it do each index update as
-it moves a tuple? Is that why it is so slow?
-If we don't do that and vacuum fails, what state is the table left in?
-If we don't update the index for every tuple, the index is invalid in a
-vacuum failure. rename() is not going to help us here. It keeps the
-old index around, but the index is invalid anyway, right?
- Bruce Momjian | http://www.op.net/~candle
- pgman@candle.pha.pa.us | (610) 853-3000
- + If your life is a hard drive, | 830 Blythe Avenue
- + Christ can be your backup. | Drexel Hill, Pennsylvania 19026
-From Inoue@tpf.co.jp Tue Jan 18 20:18:48 2000
-Received: from sd.tpf.co.jp (sd.tpf.co.jp [])
- by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id VAA12437
- for <pgman@candle.pha.pa.us>; Tue, 18 Jan 2000 21:18:46 -0500 (EST)
-Received: from cadzone ([] (may be forged))
- by sd.tpf.co.jp (2.5 Build 2640 (Berkeley 8.8.6)/8.8.4) with SMTP
- id LAA02845; Wed, 19 Jan 2000 11:18:18 +0900
-From: "Hiroshi Inoue" <Inoue@tpf.co.jp>
-To: "Bruce Momjian" <pgman@candle.pha.pa.us>
-Cc: "pgsql-hackers" <pgsql-hackers@postgreSQL.org>
-Subject: RE: [HACKERS] Index recreation in vacuum
-Date: Wed, 19 Jan 2000 11:23:55 +0900
-Message-ID: <000801bf6224$3bfdd9a0$2801007e@tpf.co.jp>
-MIME-Version: 1.0
-Content-Type: text/plain;
- charset="iso-8859-1"
-Content-Transfer-Encoding: 7bit
-X-Priority: 3 (Normal)
-X-MSMail-Priority: Normal
-X-Mailer: Microsoft Outlook 8.5, Build 4.71.2173.0
-X-MimeOLE: Produced By Microsoft MimeOLE V5.00.2314.1300
-Importance: Normal
-In-Reply-To: <200001190204.VAA11990@candle.pha.pa.us>
-Status: ROr
-> -----Original Message-----
-> From: Bruce Momjian [mailto:pgman@candle.pha.pa.us]
-> > > I don't know how people estimate this disadvantage.
-> >
-> > That's why I was recommending rename(). The actual window of
-> > vunerability goes from perhaps hours to fractions of a second.
-> >
-> > In fact, if I understand this right, you could make the vulerability
-> > zero by just performing the rename as one operation.
-> >
-> > In fact, for REINDEX cases where you don't have a lock on the entire
-> > table as you do in vacuum, you could reindex the table with a simple
-> > read-lock on the base table and index, and move the new index into place
-> > with the users seeing no change. Only people traversing the index
-> > during the change would have a problem. You just need an exclusive
-> > access on the index for the duration of the rename() so no one is
-> > traversing the index during the rename().
-> >
-> > Destroying the index and recreating opens a large time span that there
-> > is no index, and you have to jury-rig something so people don't try to
-> > use the index. With rename() you just put the new index in place with
-> > one operation. Just don't let people traverse the index during the
-> > change. The pointers to the heap tuples is the same in both indexes.
-> >
-> > In fact, with WAL, we will allow multiple physical files for the same
-> > table by appending the table oid to the file name. In this case, the
-> > old index could be deleted by rename, and people would continue to use
-> > the old index until they deleted the open file pointers. Not sure how
-> > this works in practice because new tuples would not be inserted into the
-> > old copy of the index.
-> Maybe I am all wrong here. Maybe most of the advantage of rename() are
-> meaningless with reindex using during vacuum, which is the most
-> important use of reindex.
-> Let's look at index using during vacuum. Right now, how does vacuum
-> handle indexes when it moves a tuple? Does it do each index update as
-> it moves a tuple? Is that why it is so slow?
-Yes,I believe so. It's necessary to keep consistency between heap
-table and indexes even in case of abort/crash.
-As far as I see,it has been a big charge for vacuum.
-Hiroshi Inoue
-From owner-pgsql-hackers@hub.org Tue Jan 18 20:53:49 2000
-Received: from hub.org (hub.org [])
- by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id VAA13285
- for <pgman@candle.pha.pa.us>; Tue, 18 Jan 2000 21:53:47 -0500 (EST)
-Received: from localhost (majordom@localhost)
- by hub.org (8.9.3/8.9.3) with SMTP id VAA65183;
- Tue, 18 Jan 2000 21:47:47 -0500 (EST)
- (envelope-from owner-pgsql-hackers)
-Received: by hub.org (bulk_mailer v1.5); Tue, 18 Jan 2000 21:47:33 -0500
-Received: (from majordom@localhost)
- by hub.org (8.9.3/8.9.3) id VAA65091
- for pgsql-hackers-outgoing; Tue, 18 Jan 2000 21:46:33 -0500 (EST)
- (envelope-from owner-pgsql-hackers@postgreSQL.org)
-Received: from candle.pha.pa.us (pgman@s5-03.ppp.op.net [])
- by hub.org (8.9.3/8.9.3) with ESMTP id VAA65034
- for <pgsql-hackers@postgreSQL.org>; Tue, 18 Jan 2000 21:46:12 -0500 (EST)
- (envelope-from pgman@candle.pha.pa.us)
-Received: (from pgman@localhost)
- by candle.pha.pa.us (8.9.0/8.9.0) id VAA13040;
- Tue, 18 Jan 2000 21:45:27 -0500 (EST)
-From: Bruce Momjian <pgman@candle.pha.pa.us>
-Message-Id: <200001190245.VAA13040@candle.pha.pa.us>
-Subject: Re: [HACKERS] Index recreation in vacuum
-In-Reply-To: <000801bf6224$3bfdd9a0$2801007e@tpf.co.jp> from Hiroshi Inoue at
- "Jan 19, 2000 11:23:55 am"
-To: Hiroshi Inoue <Inoue@tpf.co.jp>
-Date: Tue, 18 Jan 2000 21:45:27 -0500 (EST)
-CC: pgsql-hackers <pgsql-hackers@postgreSQL.org>
-X-Mailer: ELM [version 2.4ME+ PL66 (25)]
-MIME-Version: 1.0
-Content-Type: text/plain; charset=US-ASCII
-Content-Transfer-Encoding: 7bit
-Sender: owner-pgsql-hackers@postgreSQL.org
-Status: RO
-> > > In fact, for REINDEX cases where you don't have a lock on the entire
-> > > table as you do in vacuum, you could reindex the table with a simple
-> > > read-lock on the base table and index, and move the new index into place
-> > > with the users seeing no change. Only people traversing the index
-> > > during the change would have a problem. You just need an exclusive
-> > > access on the index for the duration of the rename() so no one is
-> > > traversing the index during the rename().
-> > >
-> > > Destroying the index and recreating opens a large time span that there
-> > > is no index, and you have to jury-rig something so people don't try to
-> > > use the index. With rename() you just put the new index in place with
-> > > one operation. Just don't let people traverse the index during the
-> > > change. The pointers to the heap tuples is the same in both indexes.
-> > >
-> > > In fact, with WAL, we will allow multiple physical files for the same
-> > > table by appending the table oid to the file name. In this case, the
-> > > old index could be deleted by rename, and people would continue to use
-> > > the old index until they deleted the open file pointers. Not sure how
-> > > this works in practice because new tuples would not be inserted into the
-> > > old copy of the index.
-> >
-> > Maybe I am all wrong here. Maybe most of the advantage of rename() are
-> > meaningless with reindex using during vacuum, which is the most
-> > important use of reindex.
-> >
-> > Let's look at index using during vacuum. Right now, how does vacuum
-> > handle indexes when it moves a tuple? Does it do each index update as
-> > it moves a tuple? Is that why it is so slow?
-> >
-> Yes,I believe so. It's necessary to keep consistency between heap
-> table and indexes even in case of abort/crash.
-> As far as I see,it has been a big charge for vacuum.
-OK, how about making a copy of the heap table before starting vacuum,
-moving all the tuples in that copy, create new index, and then move the
-new heap and indexes over the old version. We already have an exclusive
-lock on the table. That would be 100% reliable, with the disadvantage
-of using 2x the disk space. Seems like a big win.
- Bruce Momjian | http://www.op.net/~candle
- pgman@candle.pha.pa.us | (610) 853-3000
- + If your life is a hard drive, | 830 Blythe Avenue
- + Christ can be your backup. | Drexel Hill, Pennsylvania 19026
-From owner-pgsql-hackers@hub.org Tue Jan 18 21:15:24 2000
-Received: from hub.org (hub.org [])
- by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id WAA14115
- for <pgman@candle.pha.pa.us>; Tue, 18 Jan 2000 22:15:23 -0500 (EST)
-Received: from localhost (majordom@localhost)
- by hub.org (8.9.3/8.9.3) with SMTP id WAA72950;
- Tue, 18 Jan 2000 22:10:40 -0500 (EST)
- (envelope-from owner-pgsql-hackers)
-Received: by hub.org (bulk_mailer v1.5); Tue, 18 Jan 2000 22:10:32 -0500
-Received: (from majordom@localhost)
- by hub.org (8.9.3/8.9.3) id WAA72644
- for pgsql-hackers-outgoing; Tue, 18 Jan 2000 22:09:36 -0500 (EST)
- (envelope-from owner-pgsql-hackers@postgreSQL.org)
-Received: from candle.pha.pa.us (pgman@s5-03.ppp.op.net [])
- by hub.org (8.9.3/8.9.3) with ESMTP id WAA72504
- for <pgsql-hackers@postgreSQL.org>; Tue, 18 Jan 2000 22:08:40 -0500 (EST)
- (envelope-from pgman@candle.pha.pa.us)
-Received: (from pgman@localhost)
- by candle.pha.pa.us (8.9.0/8.9.0) id WAA13965;
- Tue, 18 Jan 2000 22:08:25 -0500 (EST)
-From: Bruce Momjian <pgman@candle.pha.pa.us>
-Message-Id: <200001190308.WAA13965@candle.pha.pa.us>
-Subject: Re: [HACKERS] Index recreation in vacuum
-In-Reply-To: <000f01bf622a$bf423940$2801007e@tpf.co.jp> from Hiroshi Inoue at
- "Jan 19, 2000 12:10:32 pm"
-To: Hiroshi Inoue <Inoue@tpf.co.jp>
-Date: Tue, 18 Jan 2000 22:08:25 -0500 (EST)
-CC: pgsql-hackers <pgsql-hackers@postgreSQL.org>
-X-Mailer: ELM [version 2.4ME+ PL66 (25)]
-MIME-Version: 1.0
-Content-Type: text/plain; charset=UNKNOWN-8BIT
-Content-Transfer-Encoding: 8bit
-Sender: owner-pgsql-hackers@postgreSQL.org
-Status: RO
-> I heard from someone that old vacuum had been like so.
-> Probably 2x disk space for big tables was a big disadvantage.
-That's interesting.
-> In addition,rename(),unlink(),mv aren't preferable for transaction
-> control as far as I see. We couldn't avoid inconsistency using
-> those OS functions.
-I disagree. Vacuum can't be rolled back anyway in the sense you can
-bring back expire tuples, though I have no idea why you would want to.
-You have an exclusive lock on the table. Putting new heap/indexes in
-place that match and have no expired tuples seems like it can not fail
-in any situation.
-Of course, the buffers of the old table have to be marked as invalid,
-but with an exclusive lock, that is not a problem. I am sure we do that
-anyway in vacuum.
-> We have to wait the change of relation file naming if copying
-> vacuum is needed.
-> Under the spec we need not rename(),mv etc.
-Sorry, I don't agree, yet...
- Bruce Momjian | http://www.op.net/~candle
- pgman@candle.pha.pa.us | (610) 853-3000
- + If your life is a hard drive, | 830 Blythe Avenue
- + Christ can be your backup. | Drexel Hill, Pennsylvania 19026
-From Inoue@tpf.co.jp Tue Jan 18 21:05:23 2000
-Received: from sd.tpf.co.jp (sd.tpf.co.jp [])
- by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id WAA13858
- for <pgman@candle.pha.pa.us>; Tue, 18 Jan 2000 22:05:21 -0500 (EST)
-Received: from cadzone ([] (may be forged))
- by sd.tpf.co.jp (2.5 Build 2640 (Berkeley 8.8.6)/8.8.4) with SMTP
- id MAA02870; Wed, 19 Jan 2000 12:04:55 +0900
-From: "Hiroshi Inoue" <Inoue@tpf.co.jp>
-To: "Bruce Momjian" <pgman@candle.pha.pa.us>
-Cc: "pgsql-hackers" <pgsql-hackers@postgreSQL.org>
-Subject: RE: [HACKERS] Index recreation in vacuum
-Date: Wed, 19 Jan 2000 12:10:32 +0900
-Message-ID: <000f01bf622a$bf423940$2801007e@tpf.co.jp>
-MIME-Version: 1.0
-Content-Type: text/plain;
- charset="iso-8859-1"
-Content-Transfer-Encoding: 7bit
-X-Priority: 3 (Normal)
-X-MSMail-Priority: Normal
-X-Mailer: Microsoft Outlook 8.5, Build 4.71.2173.0
-X-MimeOLE: Produced By Microsoft MimeOLE V5.00.2314.1300
-Importance: Normal
-In-Reply-To: <200001190245.VAA13040@candle.pha.pa.us>
-Status: ROr
-> -----Original Message-----
-> From: Bruce Momjian [mailto:pgman@candle.pha.pa.us]
-> > >
-> > > Maybe I am all wrong here. Maybe most of the advantage of
-> rename() are
-> > > meaningless with reindex using during vacuum, which is the most
-> > > important use of reindex.
-> > >
-> > > Let's look at index using during vacuum. Right now, how does vacuum
-> > > handle indexes when it moves a tuple? Does it do each index update as
-> > > it moves a tuple? Is that why it is so slow?
-> > >
-> >
-> > Yes,I believe so. It's necessary to keep consistency between heap
-> > table and indexes even in case of abort/crash.
-> > As far as I see,it has been a big charge for vacuum.
-> OK, how about making a copy of the heap table before starting vacuum,
-> moving all the tuples in that copy, create new index, and then move the
-> new heap and indexes over the old version. We already have an exclusive
-> lock on the table. That would be 100% reliable, with the disadvantage
-> of using 2x the disk space. Seems like a big win.
-I heard from someone that old vacuum had been like so.
-Probably 2x disk space for big tables was a big disadvantage.
-In addition,rename(),unlink(),mv aren't preferable for transaction
-control as far as I see. We couldn't avoid inconsistency using
-those OS functions.
-We have to wait the change of relation file naming if copying
-vacuum is needed.
-Under the spec we need not rename(),mv etc.
-Hiroshi Inoue
-From dms@wplus.net Wed Jan 19 15:30:40 2000
-Received: from relay.wplus.net (relay.wplus.net [])
- by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id QAA25919
- for <pgman@candle.pha.pa.us>; Wed, 19 Jan 2000 16:30:38 -0500 (EST)
-X-Real-To: pgman@candle.pha.pa.us
-Received: from wplus.net (ppdms.dialup.wplus.net [])
- by relay.wplus.net (8.9.1/8.9.1/wplus.2) with ESMTP id AAA64218;
- Thu, 20 Jan 2000 00:26:37 +0300 (MSK)
-Message-ID: <38862C9D.C2151E4E@wplus.net>
-Date: Thu, 20 Jan 2000 00:29:01 +0300
-From: Dmitry Samersoff <dms@wplus.net>
-X-Mailer: Mozilla 4.61 [en] (WinNT; I)
-X-Accept-Language: ru,en
-MIME-Version: 1.0
-To: Hiroshi Inoue <Inoue@tpf.co.jp>
-CC: Bruce Momjian <pgman@candle.pha.pa.us>,
- pgsql-hackers <pgsql-hackers@postgreSQL.org>
-Subject: Re: [HACKERS] Index recreation in vacuum
-References: <000f01bf622a$bf423940$2801007e@tpf.co.jp>
-Content-Type: text/plain; charset=koi8-r
-Content-Transfer-Encoding: 7bit
-Status: ROr
-Hiroshi Inoue wrote:
-> > > Yes,I believe so. It's necessary to keep consistency between heap
-> > > table and indexes even in case of abort/crash.
-> > > As far as I see,it has been a big charge for vacuum.
-> >
-> > OK, how about making a copy of the heap table before starting vacuum,
-> > moving all the tuples in that copy, create new index, and then move the
-> > new heap and indexes over the old version. We already have an exclusive
-> > lock on the table. That would be 100% reliable, with the disadvantage
-> > of using 2x the disk space. Seems like a big win.
-> >
-> I heard from someone that old vacuum had been like so.
-> Probably 2x disk space for big tables was a big disadvantage.
-Yes, It is critical.
-How about sequence like this:
-* Drop indices (keeping somewhere index descriptions)
-* vacuuming table
-* recreate indices
-If something crash, user have been noticed
-to re-run vacuum or recreate indices by hand
-when system restarts.
-I use script like described above for vacuuming
- - it really increase vacuum performance for large table.
-Dmitry Samersoff, DM\S
-dms@wplus.net http://devnull.wplus.net
-* there will come soft rains
-From dms@wplus.net Wed Jan 19 15:42:49 2000
-Received: from relay.wplus.net (relay.wplus.net [])
- by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id QAA26645
- for <pgman@candle.pha.pa.us>; Wed, 19 Jan 2000 16:42:47 -0500 (EST)
-X-Real-To: pgman@candle.pha.pa.us
-Received: from wplus.net (ppdms.dialup.wplus.net [])
- by relay.wplus.net (8.9.1/8.9.1/wplus.2) with ESMTP id AAA65264;
- Thu, 20 Jan 2000 00:39:02 +0300 (MSK)
-Message-ID: <38862F86.20328BD3@wplus.net>
-Date: Thu, 20 Jan 2000 00:41:26 +0300
-From: Dmitry Samersoff <dms@wplus.net>
-X-Mailer: Mozilla 4.61 [en] (WinNT; I)
-X-Accept-Language: ru,en
-MIME-Version: 1.0
-To: Bruce Momjian <pgman@candle.pha.pa.us>
-CC: Hiroshi Inoue <Inoue@tpf.co.jp>,
- pgsql-hackers <pgsql-hackers@postgreSQL.org>
-Subject: Re: [HACKERS] Index recreation in vacuum
-References: <200001192132.QAA26048@candle.pha.pa.us>
-Content-Type: text/plain; charset=koi8-r
-Content-Transfer-Encoding: 7bit
-Status: ROr
-Bruce Momjian wrote:
-> We need two things:
-> auto-create index on startup
-IMHO, It have to be controlled by user, because creating large index
-can take a number of hours. Sometimes it's better to live without
-at all, and then build it by hand after workday end.
-Dmitry Samersoff, DM\S
-dms@wplus.net http://devnull.wplus.net
-* there will come soft rains
-From owner-pgsql-hackers@hub.org Thu Jan 20 23:51:34 2000
-Received: from hub.org (hub.org [])
- by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id AAA13891
- for <pgman@candle.pha.pa.us>; Fri, 21 Jan 2000 00:51:31 -0500 (EST)
-Received: from localhost (majordom@localhost)
- by hub.org (8.9.3/8.9.3) with SMTP id AAA91784;
- Fri, 21 Jan 2000 00:47:07 -0500 (EST)
- (envelope-from owner-pgsql-hackers)
-Received: by hub.org (bulk_mailer v1.5); Fri, 21 Jan 2000 00:45:38 -0500
-Received: (from majordom@localhost)
- by hub.org (8.9.3/8.9.3) id AAA91495
- for pgsql-hackers-outgoing; Fri, 21 Jan 2000 00:44:40 -0500 (EST)
- (envelope-from owner-pgsql-hackers@postgreSQL.org)
-Received: from candle.pha.pa.us (pgman@s5-03.ppp.op.net [])
- by hub.org (8.9.3/8.9.3) with ESMTP id AAA91378
- for <pgsql-hackers@postgreSQL.org>; Fri, 21 Jan 2000 00:44:04 -0500 (EST)
- (envelope-from pgman@candle.pha.pa.us)
-Received: (from pgman@localhost)
- by candle.pha.pa.us (8.9.0/8.9.0) id AAA13592;
- Fri, 21 Jan 2000 00:43:49 -0500 (EST)
-From: Bruce Momjian <pgman@candle.pha.pa.us>
-Message-Id: <200001210543.AAA13592@candle.pha.pa.us>
-Subject: [HACKERS] vacuum timings
-To: Tom Lane <tgl@sss.pgh.pa.us>
-Date: Fri, 21 Jan 2000 00:43:49 -0500 (EST)
-CC: PostgreSQL-development <pgsql-hackers@postgreSQL.org>
-X-Mailer: ELM [version 2.4ME+ PL66 (25)]
-MIME-Version: 1.0
-Content-Type: text/plain; charset=US-ASCII
-Content-Transfer-Encoding: 7bit
-Sender: owner-pgsql-hackers@postgreSQL.org
-Status: RO
-I loaded 10,000,000 rows into CREATE TABLE test (x INTEGER); Table is
-400MB and index is 160MB.
-With index on the single in4 column, I got:
- 78 seconds for a vacuum
- 121 seconds for vacuum after deleting a single row
- 662 seconds for vacuum after deleting the entire table
-With no index, I got:
- 43 seconds for a vacuum
- 43 seconds for vacuum after deleting a single row
- 43 seconds for vacuum after deleting the entire table
-I find this quite interesting.
- Bruce Momjian | http://www.op.net/~candle
- pgman@candle.pha.pa.us | (610) 853-3000
- + If your life is a hard drive, | 830 Blythe Avenue
- + Christ can be your backup. | Drexel Hill, Pennsylvania 19026
-From owner-pgsql-hackers@hub.org Fri Jan 21 00:34:56 2000
-Received: from hub.org (hub.org [])
- by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id BAA15559
- for <pgman@candle.pha.pa.us>; Fri, 21 Jan 2000 01:34:55 -0500 (EST)
-Received: from localhost (majordom@localhost)
- by hub.org (8.9.3/8.9.3) with SMTP id BAA06108;
- Fri, 21 Jan 2000 01:32:23 -0500 (EST)
- (envelope-from owner-pgsql-hackers)
-Received: by hub.org (bulk_mailer v1.5); Fri, 21 Jan 2000 01:30:38 -0500
-Received: (from majordom@localhost)
- by hub.org (8.9.3/8.9.3) id BAA03704
- for pgsql-hackers-outgoing; Fri, 21 Jan 2000 01:27:53 -0500 (EST)
- (envelope-from owner-pgsql-hackers@postgreSQL.org)
-Received: from sunpine.krs.ru (SunPine.krs.ru [])
- by hub.org (8.9.3/8.9.3) with ESMTP id BAA01710
- for <pgsql-hackers@postgreSQL.org>; Fri, 21 Jan 2000 01:26:44 -0500 (EST)
- (envelope-from vadim@krs.ru)
-Received: from krs.ru (dune.krs.ru [])
- by sunpine.krs.ru (8.8.8/8.8.8) with ESMTP id NAA01685;
- Fri, 21 Jan 2000 13:26:33 +0700 (KRS)
-Message-ID: <3887FC19.80305217@krs.ru>
-Date: Fri, 21 Jan 2000 13:26:33 +0700
-From: Vadim Mikheev <vadim@krs.ru>
-Organization: OJSC Rostelecom (Krasnoyarsk)
-X-Mailer: Mozilla 4.5 [en] (X11; I; FreeBSD 3.0-RELEASE i386)
-X-Accept-Language: ru, en
-MIME-Version: 1.0
-To: Bruce Momjian <pgman@candle.pha.pa.us>
-CC: Tom Lane <tgl@sss.pgh.pa.us>,
- PostgreSQL-development <pgsql-hackers@postgreSQL.org>
-Subject: Re: [HACKERS] vacuum timings
-References: <200001210543.AAA13592@candle.pha.pa.us>
-Content-Type: text/plain; charset=us-ascii
-Content-Transfer-Encoding: 7bit
-Sender: owner-pgsql-hackers@postgreSQL.org
-Status: RO
-Bruce Momjian wrote:
-> I loaded 10,000,000 rows into CREATE TABLE test (x INTEGER); Table is
-> 400MB and index is 160MB.
-> With index on the single in4 column, I got:
-> 78 seconds for a vacuum
-> 121 seconds for vacuum after deleting a single row
-> 662 seconds for vacuum after deleting the entire table
-> With no index, I got:
-> 43 seconds for a vacuum
-> 43 seconds for vacuum after deleting a single row
-> 43 seconds for vacuum after deleting the entire table
-Wi/wo -F ?
-From vadim@krs.ru Fri Jan 21 00:26:33 2000
-Received: from sunpine.krs.ru (SunPine.krs.ru [])
- by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id BAA15239
- for <pgman@candle.pha.pa.us>; Fri, 21 Jan 2000 01:26:31 -0500 (EST)
-Received: from krs.ru (dune.krs.ru [])
- by sunpine.krs.ru (8.8.8/8.8.8) with ESMTP id NAA01685;
- Fri, 21 Jan 2000 13:26:33 +0700 (KRS)
-Sender: root@sunpine.krs.ru
-Message-ID: <3887FC19.80305217@krs.ru>
-Date: Fri, 21 Jan 2000 13:26:33 +0700
-From: Vadim Mikheev <vadim@krs.ru>
-Organization: OJSC Rostelecom (Krasnoyarsk)
-X-Mailer: Mozilla 4.5 [en] (X11; I; FreeBSD 3.0-RELEASE i386)
-X-Accept-Language: ru, en
-MIME-Version: 1.0
-To: Bruce Momjian <pgman@candle.pha.pa.us>
-CC: Tom Lane <tgl@sss.pgh.pa.us>,
- PostgreSQL-development <pgsql-hackers@postgreSQL.org>
-Subject: Re: [HACKERS] vacuum timings
-References: <200001210543.AAA13592@candle.pha.pa.us>
-Content-Type: text/plain; charset=us-ascii
-Content-Transfer-Encoding: 7bit
-Status: ROr
-Bruce Momjian wrote:
-> I loaded 10,000,000 rows into CREATE TABLE test (x INTEGER); Table is
-> 400MB and index is 160MB.
-> With index on the single in4 column, I got:
-> 78 seconds for a vacuum
-> 121 seconds for vacuum after deleting a single row
-> 662 seconds for vacuum after deleting the entire table
-> With no index, I got:
-> 43 seconds for a vacuum
-> 43 seconds for vacuum after deleting a single row
-> 43 seconds for vacuum after deleting the entire table
-Wi/wo -F ?
-From Inoue@tpf.co.jp Fri Jan 21 00:40:35 2000
-Received: from sd.tpf.co.jp (sd.tpf.co.jp [])
- by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id BAA15684
- for <pgman@candle.pha.pa.us>; Fri, 21 Jan 2000 01:40:33 -0500 (EST)
-Received: from cadzone ([] (may be forged))
- by sd.tpf.co.jp (2.5 Build 2640 (Berkeley 8.8.6)/8.8.4) with SMTP
- id PAA04316; Fri, 21 Jan 2000 15:40:35 +0900
-From: "Hiroshi Inoue" <Inoue@tpf.co.jp>
-To: "Bruce Momjian" <pgman@candle.pha.pa.us>
-Cc: "PostgreSQL-development" <pgsql-hackers@postgreSQL.org>,
- "Tom Lane" <tgl@sss.pgh.pa.us>
-Subject: RE: [HACKERS] vacuum timings
-Date: Fri, 21 Jan 2000 15:46:15 +0900
-Message-ID: <000201bf63db$36cdae20$2801007e@tpf.co.jp>
-MIME-Version: 1.0
-Content-Type: text/plain;
- charset="iso-8859-1"
-Content-Transfer-Encoding: 7bit
-X-Priority: 3 (Normal)
-X-MSMail-Priority: Normal
-X-Mailer: Microsoft Outlook 8.5, Build 4.71.2173.0
-Importance: Normal
-X-MimeOLE: Produced By Microsoft MimeOLE V5.00.2314.1300
-In-Reply-To: <200001210543.AAA13592@candle.pha.pa.us>
-Status: RO
-> -----Original Message-----
-> From: owner-pgsql-hackers@postgreSQL.org
-> [mailto:owner-pgsql-hackers@postgreSQL.org]On Behalf Of Bruce Momjian
-> I loaded 10,000,000 rows into CREATE TABLE test (x INTEGER); Table is
-> 400MB and index is 160MB.
-> With index on the single in4 column, I got:
-> 78 seconds for a vacuum
- vc_vaconeind() is called once
-> 121 seconds for vacuum after deleting a single row
- vc_vaconeind() is called twice
-Hmmm,vc_vaconeind() takes pretty long time even if it does little.
-> 662 seconds for vacuum after deleting the entire table
-How about half of the rows deleted case ?
-It would take longer time.
-Hiroshi Inoue
-From owner-pgsql-hackers@hub.org Fri Jan 21 12:00:49 2000
-Received: from hub.org (hub.org [])
- by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id NAA13329
- for <pgman@candle.pha.pa.us>; Fri, 21 Jan 2000 13:00:47 -0500 (EST)
-Received: from localhost (majordom@localhost)
- by hub.org (8.9.3/8.9.3) with SMTP id MAA96106;
- Fri, 21 Jan 2000 12:55:34 -0500 (EST)
- (envelope-from owner-pgsql-hackers)
-Received: by hub.org (bulk_mailer v1.5); Fri, 21 Jan 2000 12:53:53 -0500
-Received: (from majordom@localhost)
- by hub.org (8.9.3/8.9.3) id MAA95775
- for pgsql-hackers-outgoing; Fri, 21 Jan 2000 12:52:54 -0500 (EST)
- (envelope-from owner-pgsql-hackers@postgreSQL.org)
-Received: from candle.pha.pa.us (root@s5-03.ppp.op.net [])
- by hub.org (8.9.3/8.9.3) with ESMTP id MAA95720
- for <pgsql-hackers@postgreSQL.org>; Fri, 21 Jan 2000 12:52:39 -0500 (EST)
- (envelope-from pgman@candle.pha.pa.us)
-Received: (from pgman@localhost)
- by candle.pha.pa.us (8.9.0/8.9.0) id MAA12106;
- Fri, 21 Jan 2000 12:51:53 -0500 (EST)
-From: Bruce Momjian <pgman@candle.pha.pa.us>
-Message-Id: <200001211751.MAA12106@candle.pha.pa.us>
-Subject: [HACKERS] Re: vacuum timings
-In-Reply-To: <3641.948433911@sss.pgh.pa.us> from Tom Lane at "Jan 21, 2000 00:51:51
- am"
-To: Tom Lane <tgl@sss.pgh.pa.us>
-Date: Fri, 21 Jan 2000 12:51:53 -0500 (EST)
-CC: PostgreSQL-development <pgsql-hackers@postgreSQL.org>
-X-Mailer: ELM [version 2.4ME+ PL66 (25)]
-MIME-Version: 1.0
-Content-Type: text/plain; charset=US-ASCII
-Content-Transfer-Encoding: 7bit
-Sender: owner-pgsql-hackers@postgreSQL.org
-Status: RO
-> Bruce Momjian <pgman@candle.pha.pa.us> writes:
-> > I loaded 10,000,000 rows into CREATE TABLE test (x INTEGER); Table is
-> > 400MB and index is 160MB.
-> > With index on the single in4 column, I got:
-> > 78 seconds for a vacuum
-> > 121 seconds for vacuum after deleting a single row
-> > 662 seconds for vacuum after deleting the entire table
-> > With no index, I got:
-> > 43 seconds for a vacuum
-> > 43 seconds for vacuum after deleting a single row
-> > 43 seconds for vacuum after deleting the entire table
-> > I find this quite interesting.
-> How long does it take to create the index on your setup --- ie,
-> if vacuum did a drop/create index, would it be competitive?
-OK, new timings with -F enabled:
- index no index
- 519 same load
- 247 " first vacuum
- 40 " other vacuums
- 1222 X index creation
- 90 X first vacuum
- 80 X other vacuums
- <1 90 delete one row
- 121 38 vacuum after delete 1 row
- 346 344 delete all rows
- 440 44 first vacuum
- 20 <1 other vacuums(index is still same size)
- o indexes never get smaller
- o drop/recreate index is slower than vacuum of indexes
-What other conclusions can be made?
- Bruce Momjian | http://www.op.net/~candle
- pgman@candle.pha.pa.us | (610) 853-3000
- + If your life is a hard drive, | 830 Blythe Avenue
- + Christ can be your backup. | Drexel Hill, Pennsylvania 19026
-From scrappy@hub.org Fri Jan 21 12:45:38 2000
-Received: from thelab.hub.org (nat200.60.mpoweredpc.net [])
- by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id NAA14380
- for <pgman@candle.pha.pa.us>; Fri, 21 Jan 2000 13:45:29 -0500 (EST)
-Received: from localhost (scrappy@localhost)
- by thelab.hub.org (8.9.3/8.9.1) with ESMTP id OAA68289;
- Fri, 21 Jan 2000 14:45:35 -0400 (AST)
- (envelope-from scrappy@hub.org)
-X-Authentication-Warning: thelab.hub.org: scrappy owned process doing -bs
-Date: Fri, 21 Jan 2000 14:45:34 -0400 (AST)
-From: The Hermit Hacker <scrappy@hub.org>
-To: Bruce Momjian <pgman@candle.pha.pa.us>
-cc: Tom Lane <tgl@sss.pgh.pa.us>,
- PostgreSQL-development <pgsql-hackers@postgresql.org>
-Subject: Re: [HACKERS] Re: vacuum timings
-In-Reply-To: <200001211751.MAA12106@candle.pha.pa.us>
-Message-ID: <Pine.BSF.4.21.0001211443480.23487-100000@thelab.hub.org>
-MIME-Version: 1.0
-Content-Type: TEXT/PLAIN; charset=US-ASCII
-Status: RO
-On Fri, 21 Jan 2000, Bruce Momjian wrote:
-> OK, new timings with -F enabled:
-> index no index
-> 519 same load
-> 247 " first vacuum
-> 40 " other vacuums
-> 1222 X index creation
-> 90 X first vacuum
-> 80 X other vacuums
-> <1 90 delete one row
-> 121 38 vacuum after delete 1 row
-> 346 344 delete all rows
-> 440 44 first vacuum
-> 20 <1 other vacuums(index is still same size)
-> Conclusions:
-> o indexes never get smaller
-this one, I thought, was a known? if I remember right, Vadim changed it
-so that space was reused, but index never shrunk in size ... no?
-Marc G. Fournier ICQ#7615664 IRC Nick: Scrappy
-Systems Administrator @ hub.org
-primary: scrappy@hub.org secondary: scrappy@{freebsd|postgresql}.org
-From tgl@sss.pgh.pa.us Fri Jan 21 13:06:35 2000
-Received: from sss2.sss.pgh.pa.us (sss.pgh.pa.us [])
- by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id OAA14618
- for <pgman@candle.pha.pa.us>; Fri, 21 Jan 2000 14:06:33 -0500 (EST)
-Received: from sss2.sss.pgh.pa.us (tgl@localhost [])
- by sss2.sss.pgh.pa.us (8.9.3/8.9.3) with ESMTP id OAA16501;
- Fri, 21 Jan 2000 14:06:31 -0500 (EST)
-To: Bruce Momjian <pgman@candle.pha.pa.us>
-cc: PostgreSQL-development <pgsql-hackers@postgreSQL.org>
-Subject: Re: vacuum timings
-In-reply-to: <200001211751.MAA12106@candle.pha.pa.us>
-References: <200001211751.MAA12106@candle.pha.pa.us>
-Comments: In-reply-to Bruce Momjian <pgman@candle.pha.pa.us>
- message dated "Fri, 21 Jan 2000 12:51:53 -0500"
-Date: Fri, 21 Jan 2000 14:06:31 -0500
-Message-ID: <16498.948481591@sss.pgh.pa.us>
-From: Tom Lane <tgl@sss.pgh.pa.us>
-Status: RO
-Bruce Momjian <pgman@candle.pha.pa.us> writes:
-> Conclusions:
-> o indexes never get smaller
-Which we knew...
-> o drop/recreate index is slower than vacuum of indexes
-Quite a few people have reported finding the opposite in practice.
-You should probably try vacuuming after deleting or updating some
-fraction of the rows, rather than just the all or none cases.
- regards, tom lane
-From dms@wplus.net Fri Jan 21 13:51:27 2000
-Received: from relay.wplus.net (relay.wplus.net [])
- by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id OAA15623
- for <pgman@candle.pha.pa.us>; Fri, 21 Jan 2000 14:51:24 -0500 (EST)
-X-Real-To: pgman@candle.pha.pa.us
-Received: from wplus.net (ppdms.dialup.wplus.net [])
- by relay.wplus.net (8.9.1/8.9.1/wplus.2) with ESMTP id WAA89451;
- Fri, 21 Jan 2000 22:46:19 +0300 (MSK)
-Message-ID: <3888B822.28F79A1F@wplus.net>
-Date: Fri, 21 Jan 2000 22:48:50 +0300
-From: Dmitry Samersoff <dms@wplus.net>
-X-Mailer: Mozilla 4.7 [en] (WinNT; I)
-X-Accept-Language: ru,en
-MIME-Version: 1.0
-To: Tom Lane <tgl@sss.pgh.pa.us>
-CC: Bruce Momjian <pgman@candle.pha.pa.us>,
- PostgreSQL-development <pgsql-hackers@postgresql.org>
-Subject: Re: [HACKERS] Re: vacuum timings
-References: <200001211751.MAA12106@candle.pha.pa.us> <16498.948481591@sss.pgh.pa.us>
-Content-Type: text/plain; charset=koi8-r
-Content-Transfer-Encoding: 7bit
-Status: ROr
-Tom Lane wrote:
-> Bruce Momjian <pgman@candle.pha.pa.us> writes:
-> > Conclusions:
-> > o indexes never get smaller
-> Which we knew...
-> > o drop/recreate index is slower than vacuum of indexes
-> Quite a few people have reported finding the opposite in practice.
-I'm one of them. On 1,5 GB table with three indices it about twice
-Probably becouse vacuuming indices brakes system cache policy.
-(FreeBSD 3.3)
-Dmitry Samersoff, DM\S
-dms@wplus.net http://devnull.wplus.net
-* there will come soft rains
-From owner-pgsql-hackers@hub.org Fri Jan 21 14:04:08 2000
-Received: from hub.org (hub.org [])
- by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id PAA16140
- for <pgman@candle.pha.pa.us>; Fri, 21 Jan 2000 15:04:06 -0500 (EST)
-Received: from localhost (majordom@localhost)
- by hub.org (8.9.3/8.9.3) with SMTP id OAA34808;
- Fri, 21 Jan 2000 14:59:30 -0500 (EST)
- (envelope-from owner-pgsql-hackers)
-Received: by hub.org (bulk_mailer v1.5); Fri, 21 Jan 2000 14:57:48 -0500
-Received: (from majordom@localhost)
- by hub.org (8.9.3/8.9.3) id OAA34320
- for pgsql-hackers-outgoing; Fri, 21 Jan 2000 14:56:50 -0500 (EST)
- (envelope-from owner-pgsql-hackers@postgreSQL.org)
-Received: from candle.pha.pa.us (pgman@s5-03.ppp.op.net [])
- by hub.org (8.9.3/8.9.3) with ESMTP id OAA34255
- for <pgsql-hackers@postgresql.org>; Fri, 21 Jan 2000 14:56:18 -0500 (EST)
- (envelope-from pgman@candle.pha.pa.us)
-Received: (from pgman@localhost)
- by candle.pha.pa.us (8.9.0/8.9.0) id OAA15772;
- Fri, 21 Jan 2000 14:54:22 -0500 (EST)
-From: Bruce Momjian <pgman@candle.pha.pa.us>
-Message-Id: <200001211954.OAA15772@candle.pha.pa.us>
-Subject: Re: [HACKERS] Re: vacuum timings
-In-Reply-To: <3888B822.28F79A1F@wplus.net> from Dmitry Samersoff at "Jan 21,
- 2000 10:48:50 pm"
-To: Dmitry Samersoff <dms@wplus.net>
-Date: Fri, 21 Jan 2000 14:54:21 -0500 (EST)
-CC: Tom Lane <tgl@sss.pgh.pa.us>,
- PostgreSQL-development <pgsql-hackers@postgreSQL.org>
-X-Mailer: ELM [version 2.4ME+ PL66 (25)]
-MIME-Version: 1.0
-Content-Type: text/plain; charset=US-ASCII
-Content-Transfer-Encoding: 7bit
-Sender: owner-pgsql-hackers@postgreSQL.org
-Status: RO
-[Charset koi8-r unsupported, filtering to ASCII...]
-> Tom Lane wrote:
-> >
-> > Bruce Momjian <pgman@candle.pha.pa.us> writes:
-> > > Conclusions:
-> > > o indexes never get smaller
-> >
-> > Which we knew...
-> >
-> > > o drop/recreate index is slower than vacuum of indexes
-> >
-> > Quite a few people have reported finding the opposite in practice.
-> I'm one of them. On 1,5 GB table with three indices it about twice
-> slowly.
-> Probably becouse vacuuming indices brakes system cache policy.
-> (FreeBSD 3.3)
-OK, we are researching what things can be done to improve this. We are
-toying with:
- lock table for less duration, or read lock
- creating another copy of heap/indexes, and rename() over old files
- improving heap vacuum speed
- improving index vacuum speed
- moving analyze out of vacuum
- Bruce Momjian | http://www.op.net/~candle
- pgman@candle.pha.pa.us | (610) 853-3000
- + If your life is a hard drive, | 830 Blythe Avenue
- + Christ can be your backup. | Drexel Hill, Pennsylvania 19026
-From scrappy@hub.org Fri Jan 21 14:12:16 2000
-Received: from thelab.hub.org (nat200.60.mpoweredpc.net [])
- by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id PAA16521
- for <pgman@candle.pha.pa.us>; Fri, 21 Jan 2000 15:12:13 -0500 (EST)
-Received: from localhost (scrappy@localhost)
- by thelab.hub.org (8.9.3/8.9.1) with ESMTP id QAA69039;
- Fri, 21 Jan 2000 16:12:25 -0400 (AST)
- (envelope-from scrappy@hub.org)
-X-Authentication-Warning: thelab.hub.org: scrappy owned process doing -bs
-Date: Fri, 21 Jan 2000 16:12:25 -0400 (AST)
-From: The Hermit Hacker <scrappy@hub.org>
-To: Bruce Momjian <pgman@candle.pha.pa.us>
-cc: Dmitry Samersoff <dms@wplus.net>, Tom Lane <tgl@sss.pgh.pa.us>,
- PostgreSQL-development <pgsql-hackers@postgresql.org>
-Subject: Re: [HACKERS] Re: vacuum timings
-In-Reply-To: <200001211954.OAA15772@candle.pha.pa.us>
-Message-ID: <Pine.BSF.4.21.0001211607080.23487-100000@thelab.hub.org>
-MIME-Version: 1.0
-Content-Type: TEXT/PLAIN; charset=US-ASCII
-Status: RO
-On Fri, 21 Jan 2000, Bruce Momjian wrote:
-> [Charset koi8-r unsupported, filtering to ASCII...]
-> > Tom Lane wrote:
-> > >
-> > > Bruce Momjian <pgman@candle.pha.pa.us> writes:
-> > > > Conclusions:
-> > > > o indexes never get smaller
-> > >
-> > > Which we knew...
-> > >
-> > > > o drop/recreate index is slower than vacuum of indexes
-> > >
-> > > Quite a few people have reported finding the opposite in practice.
-> >
-> > I'm one of them. On 1,5 GB table with three indices it about twice
-> > slowly.
-> > Probably becouse vacuuming indices brakes system cache policy.
-> > (FreeBSD 3.3)
-> OK, we are researching what things can be done to improve this. We are
-> toying with:
-> lock table for less duration, or read lock
-if there is some way that we can work around the bug that I believe Tom
-found with removing the lock altogether (ie. makig use of MVCC), I think
-that would be the best option ... if not possible, at least get things
-down to a table lock vs the whole database?
-a good example is the udmsearch that we are using on the site ... it uses
-multiple tables to store the dictionary, each representing words of X size
-... if I'm searching on a 4 letter word, and the whole database is locked
-while it is working on the dictionary with 8 letter words, I'm sitting
-there idle ... at least if we only locked the 8 letter table, everyone not
-doing 8 letter searches can go on their merry way ...
-Slightly longer vacuum's, IMHO, are acceptable if, to the end users, its
-as transparent as possible ... locking per table would be slightly slower,
-I think, because once a table is finished, the next table would need to
-have an exclusive lock put on it before starting, so you'd have to
-possibly wait for that...?
-> creating another copy of heap/indexes, and rename() over old files
-sounds to me like introducing a large potential for error here ...
-> moving analyze out of vacuum
-I think that should be done anyway ... if we ever get to the point that
-we're able to re-use rows in tables, then that would eliminate the
-immediate requirement for vacuum, but still retain a requirement for a
-periodic analyze ... no?
-Marc G. Fournier ICQ#7615664 IRC Nick: Scrappy
-Systems Administrator @ hub.org
-primary: scrappy@hub.org secondary: scrappy@{freebsd|postgresql}.org
-From tgl@sss.pgh.pa.us Fri Jan 21 16:02:07 2000
-Received: from sss2.sss.pgh.pa.us (sss.pgh.pa.us [])
- by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id RAA20290
- for <pgman@candle.pha.pa.us>; Fri, 21 Jan 2000 17:02:06 -0500 (EST)
-Received: from sss2.sss.pgh.pa.us (tgl@localhost [])
- by sss2.sss.pgh.pa.us (8.9.3/8.9.3) with ESMTP id RAA09697;
- Fri, 21 Jan 2000 17:02:06 -0500 (EST)
-To: The Hermit Hacker <scrappy@hub.org>
-cc: Bruce Momjian <pgman@candle.pha.pa.us>,
- PostgreSQL-development <pgsql-hackers@postgreSQL.org>
-Subject: Re: [HACKERS] Re: vacuum timings
-In-reply-to: <Pine.BSF.4.21.0001211607080.23487-100000@thelab.hub.org>
-References: <Pine.BSF.4.21.0001211607080.23487-100000@thelab.hub.org>
-Comments: In-reply-to The Hermit Hacker <scrappy@hub.org>
- message dated "Fri, 21 Jan 2000 16:12:25 -0400"
-Date: Fri, 21 Jan 2000 17:02:06 -0500
-Message-ID: <9694.948492126@sss.pgh.pa.us>
-From: Tom Lane <tgl@sss.pgh.pa.us>
-Status: RO
-The Hermit Hacker <scrappy@hub.org> writes:
->> lock table for less duration, or read lock
-> if there is some way that we can work around the bug that I believe Tom
-> found with removing the lock altogether (ie. makig use of MVCC), I think
-> that would be the best option ... if not possible, at least get things
-> down to a table lock vs the whole database?
-Huh? VACUUM only requires an exclusive lock on the table it is
-currently vacuuming; there's no database-wide lock.
-Even a single-table exclusive lock is bad, of course, if it's a large
-table that's critical to a 24x7 application. Bruce was talking about
-the possibility of having VACUUM get just a write lock on the table;
-other backends could still read it, but not write it, during the vacuum
-process. That'd be a considerable step forward for 24x7 applications,
-I think.
-It looks like that could be done if we rewrote the table as a new file
-(instead of compacting-in-place), but there's a problem when it comes
-time to rename the new files into place. At that point you'd need to
-get an exclusive lock to ensure all the readers are out of the table too
---- and upgrading from a plain lock to an exclusive lock is a well-known
-recipe for deadlocks. Not sure if this can be solved.
- regards, tom lane
-From tgl@sss.pgh.pa.us Fri Jan 21 22:50:34 2000
-Received: from sss2.sss.pgh.pa.us (sss.pgh.pa.us [])
- by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id XAA01657
- for <pgman@candle.pha.pa.us>; Fri, 21 Jan 2000 23:50:28 -0500 (EST)
-Received: from sss2.sss.pgh.pa.us (tgl@localhost [])
- by sss2.sss.pgh.pa.us (8.9.3/8.9.3) with ESMTP id XAA19681;
- Fri, 21 Jan 2000 23:50:13 -0500 (EST)
-To: Bruce Momjian <pgman@candle.pha.pa.us>
-cc: PostgreSQL-development <pgsql-hackers@postgreSQL.org>
-Subject: Re: vacuum timings
-In-reply-to: <200001211751.MAA12106@candle.pha.pa.us>
-References: <200001211751.MAA12106@candle.pha.pa.us>
-Comments: In-reply-to Bruce Momjian <pgman@candle.pha.pa.us>
- message dated "Fri, 21 Jan 2000 12:51:53 -0500"
-Date: Fri, 21 Jan 2000 23:50:13 -0500
-Message-ID: <19678.948516613@sss.pgh.pa.us>
-From: Tom Lane <tgl@sss.pgh.pa.us>
-Status: ROr
-Bruce Momjian <pgman@candle.pha.pa.us> writes:
-> Conclusions:
-> o drop/recreate index is slower than vacuum of indexes
-BTW, I did some profiling of CREATE INDEX this evening (quite
-unintentionally actually; I was interested in COPY IN, but the pg_dump
-script I used as driver happened to create some indexes too). I was
-startled to discover that 60% of the runtime of CREATE INDEX is spent in
-_bt_invokestrat (which is called from tuplesort.c's comparetup_index,
-and exists only to figure out which specific comparison routine to call).
-Of this, a whopping 4% was spent in the useful subroutine, int4gt. All
-the rest went into lookup and validation checks that by rights should be
-done once per index creation, not once per comparison.
-In short: a fairly straightforward bit of optimization will eliminate
-circa 50% of the CPU time consumed by CREATE INDEX. All we need is to
-figure out where to cache the lookup results. The optimization would
-improve insertions and lookups in indexes, as well, if we can cache
-the lookup results in those scenarios.
-This was for a table small enough that tuplesort.c could do the sort
-entirely in memory, so I'm sure the gains would be smaller for a large
-table that requires a disk-based sort. Still, it seems worth looking
- regards, tom lane
-From owner-pgsql-hackers@hub.org Sat Jan 22 02:31:03 2000
-Received: from renoir.op.net (root@renoir.op.net [])
- by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id DAA06743
- for <pgman@candle.pha.pa.us>; Sat, 22 Jan 2000 03:31:02 -0500 (EST)
-Received: from hub.org (hub.org []) by renoir.op.net (o1/$Revision: 1.7 $) with ESMTP id DAA07529 for <pgman@candle.pha.pa.us>; Sat, 22 Jan 2000 03:25:13 -0500 (EST)
-Received: from localhost (majordom@localhost)
- by hub.org (8.9.3/8.9.3) with SMTP id DAA31900;
- Sat, 22 Jan 2000 03:19:53 -0500 (EST)
- (envelope-from owner-pgsql-hackers)
-Received: by hub.org (bulk_mailer v1.5); Sat, 22 Jan 2000 03:17:56 -0500
-Received: (from majordom@localhost)
- by hub.org (8.9.3/8.9.3) id DAA31715
- for pgsql-hackers-outgoing; Sat, 22 Jan 2000 03:16:58 -0500 (EST)
- (envelope-from owner-pgsql-hackers@postgreSQL.org)
-Received: from sd.tpf.co.jp (sd.tpf.co.jp [])
- by hub.org (8.9.3/8.9.3) with ESMTP id DAA31647
- for <pgsql-hackers@postgresql.org>; Sat, 22 Jan 2000 03:16:26 -0500 (EST)
- (envelope-from Inoue@tpf.co.jp)
-Received: from mcadnote1 (ppm114.noc.fukui.nsk.ne.jp [])
- by sd.tpf.co.jp (2.5 Build 2640 (Berkeley 8.8.6)/8.8.4) with SMTP
- id RAA04754; Sat, 22 Jan 2000 17:14:43 +0900
-From: "Hiroshi Inoue" <Inoue@tpf.co.jp>
-To: "Tom Lane" <tgl@sss.pgh.pa.us>, "Bruce Momjian" <pgman@candle.pha.pa.us>
-Cc: "PostgreSQL-development" <pgsql-hackers@postgresql.org>
-Subject: RE: [HACKERS] Re: vacuum timings
-Date: Sat, 22 Jan 2000 17:15:37 +0900
-MIME-Version: 1.0
-Content-Type: text/plain;
- charset="iso-2022-jp"
-Content-Transfer-Encoding: 7bit
-X-Priority: 3 (Normal)
-X-MSMail-Priority: Normal
-X-Mailer: Microsoft Outlook IMO, Build 9.0.2416 (9.0.2910.0)
-X-MimeOLE: Produced By Microsoft MimeOLE V5.00.2314.1300
-In-Reply-To: <16498.948481591@sss.pgh.pa.us>
-Importance: Normal
-Sender: owner-pgsql-hackers@postgresql.org
-Status: RO
-> -----Original Message-----
-> From: owner-pgsql-hackers@postgresql.org
-> [mailto:owner-pgsql-hackers@postgresql.org]On Behalf Of Tom Lane
-> Bruce Momjian <pgman@candle.pha.pa.us> writes:
-> > Conclusions:
-> > o indexes never get smaller
-> Which we knew...
-> > o drop/recreate index is slower than vacuum of indexes
-> Quite a few people have reported finding the opposite in practice.
-> You should probably try vacuuming after deleting or updating some
-> fraction of the rows, rather than just the all or none cases.
-Vacuum after delelting all rows isn't a worst case.
-There's no moving in that case and vacuum doesn't need to call
-index_insert() corresponding to the moving of heap tuples.
-Vacuum after deleting half of rows may be one of the worst case.
-In this case,index_delete() is called as many times as 'delete all'
-case and expensive index_insert() is called for moved_in tuples.
-Hiroshi Inoue
-From tgl@sss.pgh.pa.us Sat Jan 22 10:31:02 2000
-Received: from renoir.op.net (root@renoir.op.net [])
- by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id LAA20882
- for <pgman@candle.pha.pa.us>; Sat, 22 Jan 2000 11:31:00 -0500 (EST)
-Received: from sss2.sss.pgh.pa.us (sss.pgh.pa.us []) by renoir.op.net (o1/$Revision: 1.7 $) with ESMTP id LAA26612 for <pgman@candle.pha.pa.us>; Sat, 22 Jan 2000 11:12:44 -0500 (EST)
-Received: from sss2.sss.pgh.pa.us (tgl@localhost [])
- by sss2.sss.pgh.pa.us (8.9.3/8.9.3) with ESMTP id LAA20569;
- Sat, 22 Jan 2000 11:11:26 -0500 (EST)
-To: "Hiroshi Inoue" <Inoue@tpf.co.jp>
-cc: "Bruce Momjian" <pgman@candle.pha.pa.us>,
- "PostgreSQL-development" <pgsql-hackers@postgreSQL.org>
-Subject: Re: [HACKERS] Re: vacuum timings
-In-reply-to: <NDBBIJLOILGIKBGDINDFIEEACCAA.Inoue@tpf.co.jp>
-Comments: In-reply-to "Hiroshi Inoue" <Inoue@tpf.co.jp>
- message dated "Sat, 22 Jan 2000 17:15:37 +0900"
-Date: Sat, 22 Jan 2000 11:11:25 -0500
-Message-ID: <20566.948557485@sss.pgh.pa.us>
-From: Tom Lane <tgl@sss.pgh.pa.us>
-Status: RO
-"Hiroshi Inoue" <Inoue@tpf.co.jp> writes:
-> Vacuum after deleting half of rows may be one of the worst case.
-Or equivalently, vacuum after updating all the rows.
- regards, tom lane
-From tgl@sss.pgh.pa.us Thu Jan 20 23:51:49 2000
-Received: from sss2.sss.pgh.pa.us (sss.pgh.pa.us [])
- by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id AAA13919
- for <pgman@candle.pha.pa.us>; Fri, 21 Jan 2000 00:51:47 -0500 (EST)
-Received: from sss2.sss.pgh.pa.us (tgl@localhost [])
- by sss2.sss.pgh.pa.us (8.9.3/8.9.3) with ESMTP id AAA03644;
- Fri, 21 Jan 2000 00:51:51 -0500 (EST)
-To: Bruce Momjian <pgman@candle.pha.pa.us>
-cc: PostgreSQL-development <pgsql-hackers@postgreSQL.org>
-Subject: Re: vacuum timings
-In-reply-to: <200001210543.AAA13592@candle.pha.pa.us>
-References: <200001210543.AAA13592@candle.pha.pa.us>
-Comments: In-reply-to Bruce Momjian <pgman@candle.pha.pa.us>
- message dated "Fri, 21 Jan 2000 00:43:49 -0500"
-Date: Fri, 21 Jan 2000 00:51:51 -0500
-Message-ID: <3641.948433911@sss.pgh.pa.us>
-From: Tom Lane <tgl@sss.pgh.pa.us>
-Status: ROr
-Bruce Momjian <pgman@candle.pha.pa.us> writes:
-> I loaded 10,000,000 rows into CREATE TABLE test (x INTEGER); Table is
-> 400MB and index is 160MB.
-> With index on the single in4 column, I got:
-> 78 seconds for a vacuum
-> 121 seconds for vacuum after deleting a single row
-> 662 seconds for vacuum after deleting the entire table
-> With no index, I got:
-> 43 seconds for a vacuum
-> 43 seconds for vacuum after deleting a single row
-> 43 seconds for vacuum after deleting the entire table
-> I find this quite interesting.
-How long does it take to create the index on your setup --- ie,
-if vacuum did a drop/create index, would it be competitive?
- regards, tom lane
-From pgsql-hackers-owner+M5909@hub.org Thu Aug 17 20:15:33 2000
-Received: from hub.org (root@hub.org [])
- by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id UAA00644
- for <pgman@candle.pha.pa.us>; Thu, 17 Aug 2000 20:15:32 -0400 (EDT)
-Received: from hub.org (majordom@localhost [])
- by hub.org (8.10.1/8.10.1) with SMTP id e7I0APm69660;
- Thu, 17 Aug 2000 20:10:25 -0400 (EDT)
-Received: from fw.wintelcom.net (bright@ns1.wintelcom.net [])
- by hub.org (8.10.1/8.10.1) with ESMTP id e7I01Jm68072
- for <pgsql-hackers@postgresql.org>; Thu, 17 Aug 2000 20:01:19 -0400 (EDT)
-Received: (from bright@localhost)
- by fw.wintelcom.net (8.10.0/8.10.0) id e7I01IA20820
- for pgsql-hackers@postgresql.org; Thu, 17 Aug 2000 17:01:18 -0700 (PDT)
-Date: Thu, 17 Aug 2000 17:01:18 -0700
-From: Alfred Perlstein <bright@wintelcom.net>
-To: pgsql-hackers@postgresql.org
-Subject: [HACKERS] VACUUM optimization ideas.
-Message-ID: <20000817170118.K4854@fw.wintelcom.net>
-Mime-Version: 1.0
-Content-Type: text/plain; charset=us-ascii
-Content-Disposition: inline
-User-Agent: Mutt/1.2.4i
-X-Mailing-List: pgsql-hackers@postgresql.org
-Precedence: bulk
-Sender: pgsql-hackers-owner@hub.org
-Status: ROr
-Here's two ideas I had for optimizing vacuum, I apologize in advance
-if the ideas presented here are niave and don't take into account
-the actual code that makes up postgresql.
-Reducing the time vacuum must hold an exlusive lock on a table:
-The idea is that since rows are marked deleted it's ok for the
-vacuum to fill them with data from the tail of the table as
-long as no transaction is in progress that has started before
-the row was deleted.
-This may allow the vacuum process to copyback all the data without
-a lock, when all the copying is done it then aquires an exlusive lock
-and does this:
-Aquire an exclusive lock.
-Walk all the deleted data marking it as current.
-Truncate the table.
-Release the lock.
-Since the data is still marked invalid (right?) even if valid data
-is copied into the space it should be ignored as long as there's no
-transaction occurring that started before the data was invalidated.
-Reducing the amount of scanning a vaccum must do:
-It would make sense that if a value of the earliest deleted chunk
-was kept in a table then vacuum would not have to scan the entire
-table in order to work, it would only need to start at the 'earliest'
-invalidated row.
-The utility of this (at least for us) is that we have several tables
-that will grow to hundreds of megabytes, however changes will only
-happen at the tail end (recently added rows). If we could reduce the
-amount of time spent in a vacuum state it would help us a lot.
-I'm wondering if these ideas make sense and may help at all.
--Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org]
-From pgsql-hackers-owner+M5912@hub.org Fri Aug 18 01:36:14 2000
-Received: from hub.org (root@hub.org [])
- by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id BAA07787
- for <pgman@candle.pha.pa.us>; Fri, 18 Aug 2000 01:36:12 -0400 (EDT)
-Received: from hub.org (majordom@localhost [])
- by hub.org (8.10.1/8.10.1) with SMTP id e7I5Q2m38759;
- Fri, 18 Aug 2000 01:26:04 -0400 (EDT)
-Received: from courier02.adinet.com.uy (courier02.adinet.com.uy [])
- by hub.org (8.10.1/8.10.1) with ESMTP id e7I5Bam35785
- for <pgsql-hackers@postgresql.org>; Fri, 18 Aug 2000 01:11:37 -0400 (EDT)
-Received: from adinet.com.uy (haroldo@r207-50-240-116.adinet.com.uy [])
- by courier02.adinet.com.uy (8.9.3/8.9.3) with ESMTP id CAA17259;
- Fri, 18 Aug 2000 02:10:49 -0300 (GMT)
-Message-ID: <399CC739.B9B13D18@adinet.com.uy>
-Date: Fri, 18 Aug 2000 02:18:49 -0300
-From: hstenger@adinet.com.uy
-Reply-To: hstenger@ieee.org
-Organization: PRISMA, Servicio y Desarrollo
-X-Mailer: Mozilla 4.72 [en] (X11; I; Linux 2.2.14 i586)
-X-Accept-Language: en
-MIME-Version: 1.0
-To: Alfred Perlstein <bright@wintelcom.net>, pgsql-hackers@postgresql.org
-Subject: Re: [HACKERS] VACUUM optimization ideas.
-References: <20000817170118.K4854@fw.wintelcom.net>
-Content-Type: text/plain; charset=us-ascii
-Content-Transfer-Encoding: 7bit
-X-Mailing-List: pgsql-hackers@postgresql.org
-Precedence: bulk
-Sender: pgsql-hackers-owner@hub.org
-Status: ROr
-Alfred Perlstein wrote:
-> #1
-> Reducing the time vacuum must hold an exlusive lock on a table:
-> The idea is that since rows are marked deleted it's ok for the
-> vacuum to fill them with data from the tail of the table as
-> long as no transaction is in progress that has started before
-> the row was deleted.
-> This may allow the vacuum process to copyback all the data without
-> a lock, when all the copying is done it then aquires an exlusive lock
-> and does this:
-> Aquire an exclusive lock.
-> Walk all the deleted data marking it as current.
-> Truncate the table.
-> Release the lock.
-> Since the data is still marked invalid (right?) even if valid data
-> is copied into the space it should be ignored as long as there's no
-> transaction occurring that started before the data was invalidated.
-Yes, but nothing prevents newer transactions from modifying the _origin_ side of
-the copied data _after_ it was copied, but before the Lock-Walk-Truncate-Unlock
-cycle takes place, and so it seems unsafe. Maybe locking each record before
-copying it up ...
- Haroldo Stenger | hstenger@ieee.org
- Montevideo, Uruguay. | hstenger@adinet.com.uy
- Visit UYLUG Web Site: http://www.linux.org.uy
-From pgsql-hackers-owner+M5917@hub.org Fri Aug 18 09:41:33 2000
-Received: from hub.org (root@hub.org [])
- by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id JAA05170
- for <pgman@candle.pha.pa.us>; Fri, 18 Aug 2000 09:41:33 -0400 (EDT)
-Received: from hub.org (majordom@localhost [])
- by hub.org (8.10.1/8.10.1) with SMTP id e7IDVjm75143;
- Fri, 18 Aug 2000 09:31:46 -0400 (EDT)
-Received: from andie.ip23.net (andie.ip23.net [])
- by hub.org (8.10.1/8.10.1) with ESMTP id e7IDPIm73296
- for <pgsql-hackers@postgresql.org>; Fri, 18 Aug 2000 09:25:18 -0400 (EDT)
-Received: from imap1.ip23.net (imap1.ip23.net [])
- by andie.ip23.net (8.9.3/8.9.3) with ESMTP id PAA58387;
- Fri, 18 Aug 2000 15:25:12 +0200 (CEST)
-Received: from ip23.net (spc.ip23.net [])
- by imap1.ip23.net (8.9.3/8.9.3) with ESMTP id PAA59177;
- Fri, 18 Aug 2000 15:41:28 +0200 (CEST)
-Message-ID: <399D3938.582FDB49@ip23.net>
-Date: Fri, 18 Aug 2000 15:25:12 +0200
-From: Sevo Stille <sevo@ip23.net>
-Organization: IP23
-X-Mailer: Mozilla 4.61 [en] (X11; I; Linux 2.2.10 i686)
-X-Accept-Language: en, de
-MIME-Version: 1.0
-To: Alfred Perlstein <bright@wintelcom.net>
-CC: pgsql-hackers@postgresql.org
-Subject: Re: [HACKERS] VACUUM optimization ideas.
-References: <20000817170118.K4854@fw.wintelcom.net>
-Content-Type: text/plain; charset=us-ascii
-Content-Transfer-Encoding: 7bit
-X-Mailing-List: pgsql-hackers@postgresql.org
-Precedence: bulk
-Sender: pgsql-hackers-owner@hub.org
-Status: RO
-Alfred Perlstein wrote:
-> The idea is that since rows are marked deleted it's ok for the
-> vacuum to fill them with data from the tail of the table as
-> long as no transaction is in progress that has started before
-> the row was deleted.
-Well, isn't one of the advantages of vacuuming in the reordering it
-does? With a "fill deleted chunks" logic, we'd have far less order in
-the databases.
-> This may allow the vacuum process to copyback all the data without
-> a lock,
-Nope. Another process might update the values in between move and mark,
-if the record is not locked. We'd either have to write-lock the entire
-table for that period, write lock every item as it is moved, or lock,
-move and mark on a per-record base. The latter would be slow, but it
-could be done in a permanent low priority background process, utilizing
-empty CPU cycles. Besides, it probably could not only be done simply
-filling from the tail, but also moving up the records in a sorted
-> #2
-> Reducing the amount of scanning a vaccum must do:
-> It would make sense that if a value of the earliest deleted chunk
-> was kept in a table then vacuum would not have to scan the entire
-> table in order to work, it would only need to start at the 'earliest'
-> invalidated row.
-Trivial to do. But of course #1 may imply that the physical ordering is
-even less likely to be related to the logical ordering in a way where
-this helps.
-> The utility of this (at least for us) is that we have several tables
-> that will grow to hundreds of megabytes, however changes will only
-> happen at the tail end (recently added rows).
-The tail is a relative position - except for the case where you add
-temporary records to a constant default set, everything in the tail will
-move, at least relatively, to the head after some time.
-> If we could reduce the
-> amount of time spent in a vacuum state it would help us a lot.
-Rather: If we can reduce the time spent in a locked state while
-vacuuming, it would help a lot. Being in a vacuum is not the issue -
-even permanent vacuuming need not be an issue, if the locks it uses are
-suitably short-time.
-From pgsql-hackers-owner+M5911@hub.org Thu Aug 17 21:11:20 2000
-Received: from hub.org (root@hub.org [])
- by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id VAA01882
- for <pgman@candle.pha.pa.us>; Thu, 17 Aug 2000 21:11:20 -0400 (EDT)
-Received: from hub.org (majordom@localhost [])
- by hub.org (8.10.1/8.10.1) with SMTP id e7I119m80626;
- Thu, 17 Aug 2000 21:01:09 -0400 (EDT)
-Received: from acheron.rime.com.au (root@albatr.lnk.telstra.net [])
- by hub.org (8.10.1/8.10.1) with ESMTP id e7I0wMm79870
- for <pgsql-hackers@postgresql.org>; Thu, 17 Aug 2000 20:58:22 -0400 (EDT)
-Received: from oberon (Oberon.rime.com.au [])
- by acheron.rime.com.au (8.9.3/8.9.3) with SMTP id KAA03215;
- Fri, 18 Aug 2000 10:58:25 +1000
-Message-Id: <>
-X-Sender: pjw@mail.rhyme.com.au
-X-Mailer: QUALCOMM Windows Eudora Pro Version 3.0.5 (32)
-Date: Fri, 18 Aug 2000 10:58:35 +1000
-To: Chris Bitmead <chrisb@nimrod.itg.telstra.com.au>,
- Ben Adida <ben@openforce.net>
-From: Philip Warner <pjw@rhyme.com.au>
-Subject: Re: [HACKERS] Inserting a select statement result into another
- table
-Cc: Andrew Selle <aselle@upl.cs.wisc.edu>, pgsql-hackers@postgresql.org
-In-Reply-To: <399C7689.2DDDAD1D@nimrod.itg.telecom.com.au>
-References: <20000817130517.A10909@upl.cs.wisc.edu>
- <399BF555.43FB70C8@openforce.net>
-Mime-Version: 1.0
-Content-Type: text/plain; charset="us-ascii"
-X-Mailing-List: pgsql-hackers@postgresql.org
-Precedence: bulk
-Sender: pgsql-hackers-owner@hub.org
-Status: O
-At 09:34 18/08/00 +1000, Chris Bitmead wrote:
->He does ask a legitimate question though. If you are going to have a
->LIMIT feature (which of course is not pure SQL), there seems no reason
->you shouldn't be able to insert the result into a table.
-This feature is supported by two commercial DBs: Dec/RDB and SQL/Server. I
-have no idea if Oracle supports it, but it is such a *useful* feature that
-I would be very surprised if it didn't.
->Ben Adida wrote:
->> What is the purpose you're trying to accomplish with this order by? No
-matter what, all the
->> rows where done='f' will be inserted, and you will not be left with any
-indication of that
->> order once the rows are in the todolist table.
-I don't know what his *purpose* was, but the query should only insert the
-first two rows from the select bacause of the limit).
->> Andrew Selle wrote:
->> > Alright. My situation is this. I have a list of things that need to
-be done
->> > in a table called tasks. I have a list of users who will complete
-these tasks.
->> > I want these users to be able to come in and "claim" the top 2 most
-recent tasks
->> > that have been added. These tasks then get stored in a table called
->> > which stores who claimed the task, the taskid, and when the task was
->> > For each time someone wants to claim some number of tasks, I want to
-do something
->> > like
->> >
->> > INSERT INTO todolist
->> > SELECT taskid,'1',now()
->> > FROM tasks
->> > WHERE done='f'
->> > ORDER BY submit DESC
->> > LIMIT 2;
-Philip Warner | __---_____
-Albatross Consulting Pty. Ltd. |----/ - \
-(A.B.N. 75 008 659 498) | /(@) ______---_
-Tel: (+61) 0500 83 82 81 | _________ \
-Fax: (+61) 0500 83 82 82 | ___________ |
-Http://www.rhyme.com.au | / \|
- | --________--
-PGP key available upon request, | /
-and from pgp5.ai.mit.edu:11371 |/
-From pgsql-hackers-owner+M29308@postgresql.org Mon Sep 23 09:47:54 2002
-Return-path: <pgsql-hackers-owner+M29308@postgresql.org>
-Received: from postgresql.org (postgresql.org [])
- by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g8NDlqd00289
- for <pgman@candle.pha.pa.us>; Mon, 23 Sep 2002 09:47:53 -0400 (EDT)
-Received: from localhost (postgresql.org [])
- by postgresql.org (Postfix) with ESMTP
- id 7CA64476497; Mon, 23 Sep 2002 09:43:28 -0400 (EDT)
-Received: from postgresql.org (postgresql.org [])
- by postgresql.org (Postfix) with SMTP
- id EDA70475BC3; Mon, 23 Sep 2002 09:43:20 -0400 (EDT)
-Received: from localhost (postgresql.org [])
- by postgresql.org (Postfix) with ESMTP id 85264476479
- for <pgsql-hackers@postgresql.org>; Mon, 23 Sep 2002 09:43:15 -0400 (EDT)
-Received: from www.pspl.co.in (www.pspl.co.in [])
- by postgresql.org (Postfix) with ESMTP id C7899476477
- for <pgsql-hackers@postgresql.org>; Mon, 23 Sep 2002 09:43:12 -0400 (EDT)
-Received: (from root@localhost)
- by www.pspl.co.in (8.11.6/8.11.6) id g8NDiQ030526
- for <pgsql-hackers@postgresql.org>; Mon, 23 Sep 2002 19:14:26 +0530
-Received: from daithan (daithan.intranet.pspl.co.in [])
- by www.pspl.co.in (8.11.6/8.11.0) with ESMTP id g8NDiQ330521;
- Mon, 23 Sep 2002 19:14:26 +0530
-From: "Shridhar Daithankar" <shridhar_daithankar@persistent.co.in>
-To: pgsql-hackers@postgresql.org, pgsql-general@postgresql.org
-Date: Mon, 23 Sep 2002 19:13:44 +0530
-MIME-Version: 1.0
-Subject: [HACKERS] Postgresql Automatic vacuum
-Reply-To: shridhar_daithankar@persistent.co.in
-Message-ID: <3D8F67E8.7500.4E0E180@localhost>
-X-Mailer: Pegasus Mail for Windows (v4.02)
-Content-Type: text/plain; charset=US-ASCII
-Content-Transfer-Encoding: 7BIT
-Content-Description: Mail message body
-X-Virus-Scanned: by AMaViS new-20020517
-Precedence: bulk
-Sender: pgsql-hackers-owner@postgresql.org
-X-Virus-Scanned: by AMaViS new-20020517
-Status: OR
-Hello All,
-I have written a small daemon that can automatically vacuum PostgreSQL
-database, depending upon activity per table.
-It sits on top of postgres statistics collector. The postgres installation
-should have per row statistics collection enabled.
-Features are,
-* Vacuuming based on activity on the table
-* Per table vacuum. So only heavily updated tables are vacuumed.
-* multiple databases supported
-* Performs 'vacuum analyze' only, so it will not block the database
-The project location is
-Let me know for bugs/improvements and comments..
-I am sure real world postgres installations has some sort of scripts doing
-similar thing. This is an attempt to provide a generic interface to periodic
- Shridhar
-The Abrams' Principle: The shortest distance between two points is off the
----------------------------(end of broadcast)---------------------------
-TIP 3: if posting/reading through Usenet, please send an appropriate
-subscribe-nomail command to majordomo@postgresql.org so that your
-message can get through to the mailing list cleanly
-From pgsql-hackers-owner+M29344@postgresql.org Tue Sep 24 02:42:36 2002
-Return-path: <pgsql-hackers-owner+M29344@postgresql.org>
-Received: from postgresql.org (postgresql.org [])
- by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g8O6gYg19416
- for <pgman@candle.pha.pa.us>; Tue, 24 Sep 2002 02:42:35 -0400 (EDT)
-Received: from localhost (postgresql.org [])
- by postgresql.org (Postfix) with ESMTP
- id 128704762AF; Tue, 24 Sep 2002 02:42:36 -0400 (EDT)
-Received: from postgresql.org (postgresql.org [])
- by postgresql.org (Postfix) with SMTP
- id DE80C4760F5; Tue, 24 Sep 2002 02:42:32 -0400 (EDT)
-Received: from localhost (postgresql.org [])
- by postgresql.org (Postfix) with ESMTP id 40A8A475DBC
- for <pgsql-hackers@postgresql.org>; Tue, 24 Sep 2002 02:42:25 -0400 (EDT)
-Received: from relay.icomedias.com (relay.icomedias.com [])
- by postgresql.org (Postfix) with ESMTP id 7ECC8475DAD
- for <pgsql-hackers@postgresql.org>; Tue, 24 Sep 2002 02:42:23 -0400 (EDT)
-Received: from loki ([])
- by relay.icomedias.com (8.12.5/8.12.5) with ESMTP id g8O6g8BX014226;
- Tue, 24 Sep 2002 08:42:09 +0200
-Content-Type: text/plain;
- charset="iso-8859-1"
-From: Mario Weilguni <mweilguni@sime.com>
-To: shridhar_daithankar@persistent.co.in, matthew@zeut.net
-Subject: Re: [HACKERS] Postgresql Automatic vacuum
-Date: Tue, 24 Sep 2002 08:42:06 +0200
-User-Agent: KMail/1.4.3
-cc: pgsql-hackers@postgresql.org
-References: <3D8F67E8.7500.4E0E180@localhost> <3D9050B2.9782.86E55C0@localhost>
-In-Reply-To: <3D9050B2.9782.86E55C0@localhost>
-MIME-Version: 1.0
-Message-ID: <200209240842.06459.mweilguni@sime.com>
-avpresult: 0, ok, ok
-X-Scanned-By: MIMEDefang 2.16 (www . roaringpenguin . com / mimedefang)
-X-Virus-Scanned: by AMaViS new-20020517
-Precedence: bulk
-Sender: pgsql-hackers-owner@postgresql.org
-X-Virus-Scanned: by AMaViS new-20020517
-Content-Transfer-Encoding: 8bit
-X-MIME-Autoconverted: from quoted-printable to 8bit by candle.pha.pa.us id g8O6gYg19416
-Status: OR
-Am Dienstag, 24. September 2002 08:16 schrieb Shridhar Daithankar:
-> > I will play with it more and give you some more feedback.
-> Awaiting that.
-IMO there are still several problems with that approach, namely:
-* every database will get "polluted" with the autovacuum table, which is undesired
-* the biggest problem is the ~/.pgavrc file. I think it should work like other postgres utils do, e.g. supporting -U, -d, ....
-* it's not possible to use without activly administration the config file. it should be able to work without
- adminstrator assistance.
-When this is a daemon, why not store the data in memory? Even with several thousands of tables the memory footprint would
- still be small. And it should be possible to use for all databases without modifying a config file.
-Two weeks ago I began writing a similar daemon, but had no time yet to finish it. I've tried to avoid using fixed numbers (namely "vacuum table
-after 1000 updates") and tried to make my own heuristic based on the statistics data and the size of the table. The reason is, for a large table 1000 entries might be
-a small percentage and vacuum is not necessary, while for small tables 10 updates might be sufficient.
-Best regards,
- Mario Weilguni
----------------------------(end of broadcast)---------------------------
-TIP 2: you can get off all lists at once with the unregister command
- (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
-From pgsql-hackers-owner+M29345@postgresql.org Tue Sep 24 03:02:50 2002
-Return-path: <pgsql-hackers-owner+M29345@postgresql.org>
-Received: from postgresql.org (postgresql.org [])
- by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g8O72lg21051
- for <pgman@candle.pha.pa.us>; Tue, 24 Sep 2002 03:02:48 -0400 (EDT)
-Received: from localhost (postgresql.org [])
- by postgresql.org (Postfix) with ESMTP
- id 9B3EA4762F6; Tue, 24 Sep 2002 03:02:48 -0400 (EDT)
-Received: from postgresql.org (postgresql.org [])
- by postgresql.org (Postfix) with SMTP
- id 902EA476020; Tue, 24 Sep 2002 03:02:45 -0400 (EDT)
-Received: from localhost (postgresql.org [])
- by postgresql.org (Postfix) with ESMTP id 98689475DAD
- for <pgsql-hackers@postgresql.org>; Tue, 24 Sep 2002 03:02:18 -0400 (EDT)
-Received: from www.pspl.co.in (www.pspl.co.in [])
- by postgresql.org (Postfix) with ESMTP id 47B8647592C
- for <pgsql-hackers@postgresql.org>; Tue, 24 Sep 2002 03:02:16 -0400 (EDT)
-Received: (from root@localhost)
- by www.pspl.co.in (8.11.6/8.11.6) id g8O73QQ16318
- for <pgsql-hackers@postgresql.org>; Tue, 24 Sep 2002 12:33:26 +0530
-Received: from daithan (daithan.intranet.pspl.co.in [])
- by www.pspl.co.in (8.11.6/8.11.0) with ESMTP id g8O73Q316313
- for <pgsql-hackers@postgresql.org>; Tue, 24 Sep 2002 12:33:26 +0530
-From: "Shridhar Daithankar" <shridhar_daithankar@persistent.co.in>
-To: pgsql-hackers@postgresql.org
-Date: Tue, 24 Sep 2002 12:32:43 +0530
-MIME-Version: 1.0
-Subject: Re: [HACKERS] Postgresql Automatic vacuum
-Reply-To: shridhar_daithankar@persistent.co.in
-Message-ID: <3D905B6B.1635.898382A@localhost>
-References: <3D9050B2.9782.86E55C0@localhost>
-In-Reply-To: <200209240842.06459.mweilguni@sime.com>
-X-Mailer: Pegasus Mail for Windows (v4.02)
-Content-Type: text/plain; charset=US-ASCII
-Content-Transfer-Encoding: 7BIT
-Content-Description: Mail message body
-X-Virus-Scanned: by AMaViS new-20020517
-Precedence: bulk
-Sender: pgsql-hackers-owner@postgresql.org
-X-Virus-Scanned: by AMaViS new-20020517
-Status: OR
-On 24 Sep 2002 at 8:42, Mario Weilguni wrote:
-> Am Dienstag, 24. September 2002 08:16 schrieb Shridhar Daithankar:
-> IMO there are still several problems with that approach, namely:
-> * every database will get "polluted" with the autovacuum table, which is undesired
-I agree. But that was the best alternative I could see. explanation
-follows..Besides I didn't want to touch PG meta data..
-> * the biggest problem is the ~/.pgavrc file. I think it should work like other postgres utils do, e.g. supporting -U, -d, ....
-Shouldn't be a problem. The config stuff is working and I can add that. I would
-rather term it a minor issue. On personal preference, I would just fire it
-without any arguments. It's not a thing that you change daily. Configure it in
-config file and done..
-> * it's not possible to use without activly administration the config file. it should be able to work without
-> adminstrator assistance.
-Well. I would call that tuning. Each admin can tune it. Yes it's an effort but
-certainly not an active administration.
-> When this is a daemon, why not store the data in memory? Even with several thousands of tables the memory footprint would
-> still be small. And it should be possible to use for all databases without modifying a config file.
-Well. When postgresql has ability to deal with arbitrary number of rows, it
-seemed redundant to me to duplicate all those functionality. Why write lists
-and arrays again and again? Let postgresql do it.
-> Two weeks ago I began writing a similar daemon, but had no time yet to finish it. I've tried to avoid using fixed numbers (namely "vacuum table
-> after 1000 updates") and tried to make my own heuristic based on the statistics data and the size of the table. The reason is, for a large table 1000 entries might be
-> a small percentage and vacuum is not necessary, while for small tables 10 updates might be sufficient.
-Well, that fixed number is not really fixed but admin tunable, that too per
-database. These are just defaults. Tune it to suit your needs.
-The objective of whole exercise is to get rid of periodic vacuum as this app.
-shifts threshold to activity rather than time.
-Besides a table should be vacuumed when it starts affecting performance. On an
-installation if a table a 1M rows and change 1K rows affects performance, there
-will be a similar performance hit for a 100K rows table for 1K rows update.
-Because overhead involved would be almost same.(Not disk space. pgavd does not
-target vacuum full but tuple size should matter).
-At least me thinks so..
-I plan to implement per table threshold in addition to per database thresholds.
-But right now, it seems like overhead to me. Besides there is an item in TODO,
-to shift unit of work from rows to blocks affected. I guess that takes care of
-some of your points..
- Shridhar
-Jones' Second Law: The man who smiles when things go wrong has thought of
-someone to blame it on.
----------------------------(end of broadcast)---------------------------
-TIP 5: Have you checked our extensive FAQ?
+++ /dev/null
-From selkovjr@mcs.anl.gov Sat Jul 25 05:31:05 1998
-Received: from renoir.op.net (root@renoir.op.net [])
- by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id FAA16564
- for <maillist@candle.pha.pa.us>; Sat, 25 Jul 1998 05:31:03 -0400 (EDT)
-Received: from antares.mcs.anl.gov (mcs.anl.gov []) by renoir.op.net (o1/$ Revision: 1.18 $) with SMTP id FAA01775 for <maillist@candle.pha.pa.us>; Sat, 25 Jul 1998 05:28:22 -0400 (EDT)
-Received: from mcs.anl.gov (wit.mcs.anl.gov []) by antares.mcs.anl.gov (8.6.10/8.6.10) with ESMTP
- id EAA28698 for <maillist@candle.pha.pa.us>; Sat, 25 Jul 1998 04:27:05 -0500
-Sender: selkovjr@mcs.anl.gov
-Message-ID: <35B9968D.21CF60A2@mcs.anl.gov>
-Date: Sat, 25 Jul 1998 08:25:49 +0000
-From: "Gene Selkov, Jr." <selkovjr@mcs.anl.gov>
-Organization: MCS, Argonne Natl. Lab
-X-Mailer: Mozilla 4.03 [en] (X11; I; Linux 2.0.32 i586)
-MIME-Version: 1.0
-To: Bruce Momjian <maillist@candle.pha.pa.us>
-Subject: position-aware scanners
-References: <199807250524.BAA07296@candle.pha.pa.us>
-Content-Type: text/plain; charset=us-ascii
-Content-Transfer-Encoding: 7bit
-Status: RO
-I attached here (trough the web links) a couple examples, totally
-irrelevant to postgres but good enough to discuss token locations. I
-might as well try to patch the backend parser, though not sure how soon.
-The first c parser I wrote,
-http://wit.mcs.anl.gov/~selkovjr/unit-troff.tgz, is not very
-sophisticated, so token locations reported by yyerr() may be slightly
-incorrect (+/- one position depending on the existence and type of the
-lookahead token. It is a filter used to typeset the units of measurement
-with eqn. To use it, unpack the tar file and run make. The Makefile is
-not too generic but I built it on various systems including linux,
-freebsd and sunos 4.3. The invocation can be something like this:
-./check 0 parse "l**3/(mmoll*min)"
-parse error, expecting `BASIC_UNIT' or `INTEGER' or `POSITIVE_NUMBER' or
- ^^^^^
-Now to the guts. As far as I can imagine, the only way to consistently
-keep track of each character read by the scanner (regardless of the
-length of expressions it will match) is to redefine its YY_INPUT like
-#undef YY_INPUT
-#define YY_INPUT(buf,result,max_size) \
-{ \
- int c = (int) buffer[pos++]; \
- result = (c == '\0') ? YY_NULL : (buf[0] = c, 1); \
-Here, buffer is the pointer to the origin of the string being scanned
-and pos is a global variable, similar in usage to a file pointer (you
-can both read and manipulate it at will). The buffer and the pointer are
-initialized by the function
-void setString(char *s)
- buffer = s;
- pos = 0;
-each time the new string is to be parsed. This (exportable) function is
-part of the interface.
-In this simplistic design, yyerror() is part of the scanner module and
-it uses the pos variable to report the location of unexpected tokens.
-The downside of such arrangement is that in case of error condition, you
-can't easily tell whether your context is current or lookahead token, it
-just reports the position of the last token read (be it $ (end of
-buffer) or something else):
-./check 0 convert "mol/foo"
-parse error, expecting `BASIC_UNIT' or `INTEGER' or `POSITIVE_NUMBER' or
- ^^^
-(should be at the beginning of "foo")
-./check 0 convert "mmol//l"
-parse error, expecting `BASIC_UNIT' or `INTEGER' or `POSITIVE_NUMBER' or
- ^
-(should be at the second '/')
-I believe this is why most simple parsers made with yacc would report
-parse errors being "at or near" some token, which is fair enough if the
-expression is not too complex.
-2. The second version of the same scanner,
-http://wit.mcs.anl.gov/~selkovjr/scanner-example.tgz, addresses this
-problem by recording exact locations of the tokens in each instance of
-the token semantic data structure. The global,
-UNIT_YYSTYPE unit_yylval;
-would be normally used to export the token semantics (including its
-original or modified text and location data) to the parser.
-Unfortunately, I cannot show you the parser part in c, because that's
-about when I stopped writing parsers in c. Instead, I included a small
-test program, test.c, that mimics the parser's expectations for the
-scanner data pretty well. I am assuming here that you are not interested
-in digging someone else's ugly guts for relatively small bit of
-information; let me know if I am wrong and I will send you the complete
-perl code (also generated with bison).
-To run this example, unpack the tar file and run Make. Then do
- gcc test.c scanner.o
-and run a.out
-Note the line
- yylval = unit_getyylval();
-in test.c. You will not normally need it in a c parser. It is enough to
-define yylval as an external variable and link it to yylval in yylex()
-In the bison-generated parser, yylval gets pushed into a stack (pointed
-to by yylsp) each time a new token is read. For each syntax rule, the
-bison macros @1, @2, ... are just shortcuts to locations in the stack 1,
-2, ... levels deep. In following code fragment, @3 refers to the
-location info for the third term in the rule (INTEGER):
-(sorry about perl, but I think you can do the same things in c without
-significant changes to your existing parser)
-term: base {
- $$ = $1;
- $$->{'order'} = 1;
- }
- | base EXP INTEGER {
- $$ = $1;
- $$->{'order'} = @3->{'text'};
- $$->{'scale'} = $$->{'scale'} ** $$->{'order'};
- if ( $$->{'order'} == 0 ) {
- yyerror("Error: expecting a non-zero
-integer exponent");
- }
- }
-which translates to:
- ($yyn == 10) && do {
- $yyval = $yyvsa[-1];
- $yyval->{'order'} = 1;
- last SWITCH;
- };
- ($yyn == 11) && do {
- $yyval = $yyvsa[-3];
- $yyval->{'order'} = $yylsa[-1]->{'text'}
- $yyval->{'scale'} = $yyval->{'scale'} ** $yyval->{'order'};
- if ( $yyval->{'order'} == 0 ) {
- yyerror("Error: expecting a non-zero integer
- goto yyerrlab1 ;
- }
- last SWITCH;
- };
-In c, you will have a bit more complicated pointer arithmetic to adress
-the stack, but the usage of objects will be the same. Note here that it
-is convenient to keep all information about the token in its location
-info, (yylsa, yylsp, yylval, @n), while everything relating to the value
-of the expression, or to the parse tree, is better placed in the
-semantic stack (yyssa, yyssp, yysval, $n). Also note that in some cases
-you can do semantic checks inside rules and report useful messages
-before or instead of invoking yyerror();
-Finally, it is useful to make the following wrapper function around
-external yylex() in order to maintain your own token stack. Unlike the
-parser's internal stack which is only as deep as the rule being reduced,
-this one can hold all tokens recognized during the current run, and that
-can be extremely helpful for error reporting and any transformations you
-may need. In this way, you can even scan (tokenize) the whole buffer
-before handing it off to the parser (who knows, you may need a token
-ahead of what is currently seen by the parser):
-sub tokenize {
- undef @tokenTable;
- my ($tok, $text, $name, $unit, $first_line, $first_column,
-$last_line, $last_column);
- while ( ($tok = &UnitLex::yylex()) > 0 ) { # this is where the
-c-coded yylex is called,
- # UnitLex is the perl
-extension encapsulating it
- ( $text, $name, $unit, $first_line, $first_column, $last_line,
-$last_column ) = &UnitLex::getyylval;
- push(@tokenTable,
- Unit::yyltype->new (
- 'token' => $tok,
- 'text' => $text,
- 'name' => $name,
- 'unit' => $unit,
- 'first_line' => $first_line,
- 'first_column' => $first_column,
- 'last_line' => $last_line,
- 'last_column' => $last_column,
- )
- )
- }
-It is now a lot easier to handle various state-related problems, such as
-backtracking and error reporting. The yylex() function as seen by the
-parser might be constructed somewhat like this:
-sub yylex {
- $yylloc = $tokenTable[$tokenNo]; # $tokenNo is a global; now
-instead of a "file pointer",
- # as in the first example, we have
-a "token pointer"
- undef $yylval;
- # disregard this; name this block "computing semantic values"
- if ( $yylloc->{'token'} == UNIT) {
- $yylval = Unit::Operand->new(
- 'unit' => Unit::Dict::unit($yylloc->{'unit'}),
- 'base' => Unit::Dict::base($yylloc->{'unit'}),
- 'scale' => Unit::Dict::scale($yylloc->{'unit'}),
- 'scaleToBase' => Unit::Dict::scaleToBase($yylloc->{'unit'}),
- 'loc' => $yylloc,
- );
- }
- elsif ( ($yylloc->{'token'} == INTEGER ) || ($yylloc->{'token'} ==
- $yylval = Unit::Operand->new(
- 'unit' => '1',
- 'base' => '1',
- 'scale' => 1,
- 'scaleToBase' => 1,
- 'loc' => $yylloc,
- );
- }
- $tokenNo++;
- return(%{$yylloc}->{'token'}); # This is all the parser needs to
-know about this token.
- # But we already made sure we saved
-everything we need to know.
-Now the most interesting part, the error reporting routine:
-sub yyerror {
- my ($str) = @_;
- my ($message, $start, $end, $loc);
- $loc = $tokenTable[$tokenNo-1]; # This is the same as to say,
- # "obtain the location info for the
-current token"
- # You may use this routine for your own purposes or let parser use
- if( $str ne 'parse error' ) {
- $message = "$str instead of `" . $loc->{'name'} . "' <" .
-$loc->{'text'} . ">, at line " . $loc->{'first_line'} . ":\n\
- }
- else {
- $message = "unexpected token `" . $loc->{'name'} . "' <" .
-$loc->{'text'} . ">, at line " . loc->{'first_line'} . ":\n
- }
- $message .= $parseBuffer . "\n"; # that's the original string that
-was used to set the parser buffer
- $message .= ( ' ' x ($loc->{'first_column'} + 1) ) . ( '^' x
-length($loc->{'text'}) ). "\n";
- if( $str ne 'parse error' ) {
- print STDERR "$str instead of `", $loc->{'name'}, "' {",
-$loc->{'text'}, "}, at line ", $loc->{'first_line'}, ":\n\n";
- }
- else {
- print STDERR "unexpected token `", $loc->{'name'}, "' {",
-$loc->{'text'}, "}, at line ", $loc->{'first_line'}, ":\n\n";
- }
- print STDERR "$parseBuffer\n";
- print STDERR ' ' x ($loc->{'first_column'} + 1), '^' x
-length($loc->{'text'}), "\n";
-Scanners used in these examples assume there is a single line of text on
-the input (the first_line and last_line elements of yylloc are simply
-ignored). If you want to be able to parse multi-line buffers, just add a
-lex rule for '\n' that will increment the line count and reset the pos
-variable to zero.
-Ugly as it may seem, I find this approach extremely liberating. If the
-grammar becomes too complicated for a LALR(1) parser, I can cascade
-multiple parsers. The token table can then be used to reassemble parts
-of original expression for subordinate parsers, preserving the location
-info all the way down, so that subordinate parsers can report their
-problems consistently. You probably don't need this, as SQL is very well
-thought of and has parsable grammar. But it may be of some help, for
-error reporting.
-From pgsql-patches-owner+M1499@postgresql.org Sat Aug 4 13:11:53 2001
-Return-path: <pgsql-patches-owner+M1499@postgresql.org>
-Received: from postgresql.org (webmail.postgresql.org [])
- by candle.pha.pa.us (8.10.1/8.10.1) with ESMTP id f74HBrh11339
- for <pgman@candle.pha.pa.us>; Sat, 4 Aug 2001 13:11:53 -0400 (EDT)
-Received: from postgresql.org.org (webmail.postgresql.org [])
- by postgresql.org (8.11.3/8.11.4) with SMTP id f74H89655183;
- Sat, 4 Aug 2001 13:08:09 -0400 (EDT)
- (envelope-from pgsql-patches-owner+M1499@postgresql.org)
-Received: from sss.pgh.pa.us ([])
- by postgresql.org (8.11.3/8.11.4) with ESMTP id f74Gxb653074
- for <pgsql-patches@postgresql.org>; Sat, 4 Aug 2001 12:59:37 -0400 (EDT)
- (envelope-from tgl@sss.pgh.pa.us)
-Received: from sss2.sss.pgh.pa.us (tgl@localhost [])
- by sss.pgh.pa.us (8.11.4/8.11.4) with ESMTP id f74GtPC29183;
- Sat, 4 Aug 2001 12:55:25 -0400 (EDT)
-To: Dave Page <dpage@vale-housing.co.uk>
-cc: "'Fernando Nasser'" <fnasser@cygnus.com>,
- Bruce Momjian <pgman@candle.pha.pa.us>, Neil Padgett <npadgett@redhat.com>,
- pgsql-patches@postgresql.org
-Subject: Re: [PATCHES] Patch for Improved Syntax Error Reporting
-In-Reply-To: <8568FC767B4AD311AC33006097BCD3D61A2D70@woody.vale-housing.co.uk>
-References: <8568FC767B4AD311AC33006097BCD3D61A2D70@woody.vale-housing.co.uk>
-Comments: In-reply-to Dave Page <dpage@vale-housing.co.uk>
- message dated "Sat, 04 Aug 2001 12:37:23 +0100"
-Date: Sat, 04 Aug 2001 12:55:24 -0400
-Message-ID: <29180.996944124@sss.pgh.pa.us>
-From: Tom Lane <tgl@sss.pgh.pa.us>
-Precedence: bulk
-Sender: pgsql-patches-owner@postgresql.org
-Status: OR
-Dave Page <dpage@vale-housing.co.uk> writes:
-> Oh, I quite agree. I'm not adverse to updating my code, I just want to avoid
-> users getting misleading messages until I come up with those updates.
-Hmm ... if they were actively misleading then I'd share your concern.
-I guess what you're thinking is that the error offset reported by the
-backend won't correspond directly to what the user typed, and if the
-user tries to use the offset to manually count off characters, he may
-arrive at the wrong place? Good point. I'm not sure whether a message
- ERROR: parser: parse error at or near 'frum';
-would be likely to encourage people to try that. Thoughts? (I do think
-this is a good argument for not embedding the position straight into the
-main error message though...)
-One possible compromise is to combine the straight character-offset
-approach with a simplistic context display:
- ERROR: parser: parse error at or near 'frum';
- POSITION: 42 ... oid,relname FRUM ...
-The idea is to define the "POSITION" field as an integer offset possibly
-followed by whitespace and noise words. An updated client would grab
-the offset, ignore the rest of the field, and do the right thing. A
-not-updated client would display the entire message, and with any luck
-the user would read it correctly.
- regards, tom lane
----------------------------(end of broadcast)---------------------------
-TIP 5: Have you checked our extensive FAQ?