From: Bruce Momjian Date: Thu, 12 Feb 2004 18:11:54 +0000 (+0000) Subject: Remove TODO.detail files that contained useless or very old information. X-Git-Tag: REL8_0_0BETA1~1186 X-Git-Url: https://granicus.if.org/sourcecode?a=commitdiff_plain;h=2b721d3d413598bd0bd09d5045fbcb66f5982353;p=postgresql Remove TODO.detail files that contained useless or very old information. Update TODO accordingly. --- diff --git a/doc/TODO.detail/foreign b/doc/TODO.detail/foreign deleted file mode 100644 index f05cf35f1c..0000000000 --- a/doc/TODO.detail/foreign +++ /dev/null @@ -1,542 +0,0 @@ -From fjoe@iclub.nsu.ru Tue Jan 23 03:38:45 2001 -Received: from mx.nsu.ru (root@mx.nsu.ru [193.124.215.71]) - by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id DAA14458 - for ; Tue, 23 Jan 2001 03:38:24 -0500 (EST) -Received: from iclub.nsu.ru (root@iclub.nsu.ru [193.124.222.66]) - by mx.nsu.ru (8.9.1/8.9.0) with ESMTP id OAA29153; - Tue, 23 Jan 2001 14:31:27 +0600 (NOVT) -Received: from localhost (fjoe@localhost) - by iclub.nsu.ru (8.11.1/8.11.1) with ESMTP id f0N8VOr15273; - Tue, 23 Jan 2001 14:31:25 +0600 (NS) - (envelope-from fjoe@iclub.nsu.ru) -Date: Tue, 23 Jan 2001 14:31:24 +0600 (NS) -From: Max Khon -To: Bruce Momjian -cc: PostgreSQL-development -Subject: Re: [HACKERS] Bug in FOREIGN KEY -In-Reply-To: <200101230416.XAA04293@candle.pha.pa.us> -Message-ID: -MIME-Version: 1.0 -Content-Type: TEXT/PLAIN; charset=US-ASCII -Status: RO - -hi, there! - -On Mon, 22 Jan 2001, Bruce Momjian wrote: - -> -> > This problem with foreign keys has been reported to me, and I have confirmed -> > the bug exists in current sources. The DELETE should succeed: -> > -> > --------------------------------------------------------------------------- -> > -> > CREATE TABLE primarytest2 ( -> > col1 INTEGER, -> > col2 INTEGER, -> > PRIMARY KEY(col1, col2) -> > ); -> > -> > CREATE TABLE foreigntest2 (col3 INTEGER, -> > col4 INTEGER, -> > FOREIGN KEY (col3, col4) REFERENCES primarytest2 -> > ); -> > test=> BEGIN; -> > BEGIN -> > test=> INSERT INTO primarytest2 VALUES (5,5); -> > INSERT 27618 1 -> > test=> DELETE FROM primarytest2 WHERE col1 = 5 AND col2 = 5; -> > ERROR: triggered data change violation on relation "primarytest2" - -I have another (slightly different) example: ---- cut here --- -test=> CREATE TABLE pr(obj_id int PRIMARY KEY); -NOTICE: CREATE TABLE/PRIMARY KEY will create implicit index 'pr_pkey' for -table 'pr' -CREATE -test=> CREATE TABLE fr(obj_id int REFERENCES pr ON DELETE CASCADE); -NOTICE: CREATE TABLE will create implicit trigger(s) for FOREIGN KEY -check(s) -CREATE -test=> BEGIN; -BEGIN -test=> INSERT INTO pr (obj_id) VALUES (1); -INSERT 200539 1 -test=> INSERT INTO fr (obj_id) SELECT obj_id FROM pr; -INSERT 200540 1 -test=> DELETE FROM fr; -ERROR: triggered data change violation on relation "fr" -test=> ---- cut here --- - -we are running postgresql 7.1 beta3 - -/fjoe - - -From sszabo@megazone23.bigpanda.com Tue Jan 23 13:41:55 2001 -Received: from megazone23.bigpanda.com (rfx-64-6-210-138.users.reflexcom.com [64.6.210.138]) - by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id NAA19924 - for ; Tue, 23 Jan 2001 13:41:54 -0500 (EST) -Received: from localhost (sszabo@localhost) - by megazone23.bigpanda.com (8.11.1/8.11.1) with ESMTP id f0NIfLa41018; - Tue, 23 Jan 2001 10:41:21 -0800 (PST) -Date: Tue, 23 Jan 2001 10:41:21 -0800 (PST) -From: Stephan Szabo -To: Bruce Momjian -cc: Jan Wieck , Peter Eisentraut , - PostgreSQL-development -Subject: Re: [HACKERS] Bug in FOREIGN KEY -In-Reply-To: <200101230417.XAA04332@candle.pha.pa.us> -Message-ID: -MIME-Version: 1.0 -Content-Type: TEXT/PLAIN; charset=US-ASCII -Status: RO - - -> > Think I misinterpreted the SQL3 specs WR to this detail. The -> > checks must be made per statement, not at the transaction -> > level. I'll try to fix it, but we need to define what will -> > happen with referential actions in the case of conflicting -> > actions on the same key - there are some possible conflicts: -> > -> > 1. DEFERRED ON DELETE NO ACTION or RESTRICT -> > -> > Do the referencing rows reference to the new PK row with -> > the same key now, or is this still a constraint -> > violation? I would say it's not, because the constraint -> > condition is satisfied at the end of the transaction. How -> > do other databases behave? -> > -> > 2. DEFERRED ON DELETE CASCADE, SET NULL or SET DEFAULT -> > -> > Again I'd say that the action should be suppressed -> > because a matching PK row is present at transaction end - -> > it's not the same old row, but the constraint itself is -> > still satisfied. - -I'm not actually sure on the cascade, set null and set default. The -way they are written seems to imply to me that it's based on the state -of the database before/after the command in question as opposed to the -deferred state of the database because of the stuff about updating the -state of partially matching rows immediately after the delete/update of -the row which wouldn't really make sense when deferred. Does anyone know -what other systems do with a case something like this all in a -transaction: - -create table a (a int primary key); -create table b (b int references a match full on update cascade - on delete cascade deferrable initially deferred); -insert into a values (1); -insert into a values (2); -insert into b values (1); -delete from a where a=1; -select * from b; -commit; - - -From pgsql-hackers-owner+M3901@postgresql.org Fri Jan 26 17:00:24 2001 -Received: from mail.postgresql.org (webmail.postgresql.org [216.126.85.28]) - by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id RAA10576 - for ; Fri, 26 Jan 2001 17:00:24 -0500 (EST) -Received: from mail.postgresql.org (webmail.postgresql.org [216.126.85.28]) - by mail.postgresql.org (8.11.1/8.11.1) with SMTP id f0QLtVq53019; - Fri, 26 Jan 2001 16:55:31 -0500 (EST) - (envelope-from pgsql-hackers-owner+M3901@postgresql.org) -Received: from smtp1b.mail.yahoo.com (smtp3.mail.yahoo.com [128.11.68.135]) - by mail.postgresql.org (8.11.1/8.11.1) with SMTP id f0QLqmq52691 - for ; Fri, 26 Jan 2001 16:52:48 -0500 (EST) - (envelope-from janwieck@yahoo.com) -Received: from j13.us.greatbridge.com (HELO jupiter.greatbridge.com) (216.54.52.153) - by smtp.mail.vip.suc.yahoo.com with SMTP; 26 Jan 2001 22:49:57 -0000 -X-Apparently-From: -Received: (from janwieck@localhost) - by jupiter.greatbridge.com (8.9.3/8.9.3) id RAA04701; - Fri, 26 Jan 2001 17:02:32 -0500 -From: Jan Wieck -Message-Id: <200101262202.RAA04701@jupiter.greatbridge.com> -Subject: Re: [HACKERS] Bug in FOREIGN KEY -In-Reply-To: <200101262110.QAA06902@candle.pha.pa.us> from Bruce Momjian at "Jan - 26, 2001 04:10:22 pm" -To: Bruce Momjian -Date: Fri, 26 Jan 2001 17:02:32 -0500 (EST) -CC: Jan Wieck , Peter Eisentraut , - PostgreSQL-development -X-Mailer: ELM [version 2.4ME+ PL68 (25)] -MIME-Version: 1.0 -Content-Type: text/plain; charset=US-ASCII -Content-Transfer-Encoding: 7bit -Precedence: bulk -Sender: pgsql-hackers-owner@postgresql.org -Status: RO - -Bruce Momjian wrote: -> Here is another bug: -> -> test=> begin; -> BEGIN -> test=> INSERT INTO primarytest2 VALUES (5,5); -> INSERT 18757 1 -> test=> UPDATE primarytest2 SET col2=1 WHERE col1 = 5 AND col2 = 5; -> ERROR: deferredTriggerGetPreviousEvent: event for tuple (0,10) not -> found - - Schema? - - -Jan - --- - -#======================================================================# -# It's easier to get forgiveness for being wrong than for being right. # -# Let's break this rule - forgive me. # -#================================================== JanWieck@Yahoo.com # - - - -_________________________________________________________ -Do You Yahoo!? -Get your free @yahoo.com address at http://mail.yahoo.com - - -From pgsql-hackers-owner+M3864@postgresql.org Fri Jan 26 10:07:36 2001 -Received: from mail.postgresql.org (webmail.postgresql.org [216.126.85.28]) - by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id KAA17732 - for ; Fri, 26 Jan 2001 10:07:35 -0500 (EST) -Received: from mail.postgresql.org (webmail.postgresql.org [216.126.85.28]) - by mail.postgresql.org (8.11.1/8.11.1) with SMTP id f0QF3lq12782; - Fri, 26 Jan 2001 10:03:47 -0500 (EST) - (envelope-from pgsql-hackers-owner+M3864@postgresql.org) -Received: from mailout00.sul.t-online.com (mailout00.sul.t-online.com [194.25.134.16]) - by mail.postgresql.org (8.11.1/8.11.1) with ESMTP id f0QF0Yq12614 - for ; Fri, 26 Jan 2001 10:00:34 -0500 (EST) - (envelope-from peter_e@gmx.net) -Received: from fwd01.sul.t-online.com - by mailout00.sul.t-online.com with smtp - id 14MALp-0006Im-00; Fri, 26 Jan 2001 15:59:45 +0100 -Received: from peter.localdomain (520083510237-0001@[212.185.245.73]) by fmrl01.sul.t-online.com - with esmtp id 14MALQ-1Z0gkaC; Fri, 26 Jan 2001 15:59:20 +0100 -Date: Fri, 26 Jan 2001 16:07:27 +0100 (CET) -From: Peter Eisentraut -To: Hiroshi Inoue -cc: Bruce Momjian , - PostgreSQL-development -Subject: Re: [HACKERS] Open 7.1 items -In-Reply-To: <3A70FA87.933B3D51@tpf.co.jp> -Message-ID: -MIME-Version: 1.0 -Content-Type: TEXT/PLAIN; charset=US-ASCII -X-Sender: 520083510237-0001@t-dialin.net -Precedence: bulk -Sender: pgsql-hackers-owner@postgresql.org -Status: RO - -Hiroshi Inoue writes: - -> What does this item mean ? -> Is it the following ? -> -> begin; -> insert into pk (id) values (1); -> update(delete from) pk where id=1; -> ERROR: triggered data change violation on relation pk" -> -> If so, isn't it a simple bug ? - -Depends on the definition of "bug". It's not spec compliant and it's not -documented and it's annoying. But it's been like this for a year and the -issue is well known and can normally be avoided. It looks like a -documentation to-do to me. - --- -Peter Eisentraut peter_e@gmx.net http://yi.org/peter-e/ - - -From pgsql-hackers-owner+M3876@postgresql.org Fri Jan 26 13:07:10 2001 -Received: from mail.postgresql.org (webmail.postgresql.org [216.126.85.28]) - by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id NAA26086 - for ; Fri, 26 Jan 2001 13:07:09 -0500 (EST) -Received: from mail.postgresql.org (webmail.postgresql.org [216.126.85.28]) - by mail.postgresql.org (8.11.1/8.11.1) with SMTP id f0QI4Vq30248; - Fri, 26 Jan 2001 13:04:31 -0500 (EST) - (envelope-from pgsql-hackers-owner+M3876@postgresql.org) -Received: from sectorbase2.sectorbase.com ([208.48.122.131]) - by mail.postgresql.org (8.11.1/8.11.1) with SMTP id f0QI3Aq30098 - for ; Fri, 26 Jan 2001 13:03:11 -0500 (EST) - (envelope-from vmikheev@SECTORBASE.COM) -Received: by sectorbase2.sectorbase.com with Internet Mail Service (5.5.2653.19) - id ; Fri, 26 Jan 2001 09:41:23 -0800 -Message-ID: <8F4C99C66D04D4118F580090272A7A234D32C1@sectorbase1.sectorbase.com> -From: "Mikheev, Vadim" -To: "'Jan Wieck'" , - PostgreSQL HACKERS - , - Bruce Momjian -Subject: RE: [HACKERS] Open 7.1 items -Date: Fri, 26 Jan 2001 10:02:59 -0800 -MIME-Version: 1.0 -X-Mailer: Internet Mail Service (5.5.2653.19) -Content-Type: text/plain; - charset="iso-8859-1" -Precedence: bulk -Sender: pgsql-hackers-owner@postgresql.org -Status: RO - -> > FOREIGN KEY INSERT & UPDATE/DELETE in transaction "change violation" -> -> A well known issue, and I've asked multiple times how exactly -> we want to define the behaviour for deferred constraints. Do -> foreign keys reference just to a key value and are happy with -> it's existance, or do they refer to a particular row? - -I think first. The last is closer to OODBMS world, not to [O]RDBMS one. - -> Consider you have a deferred "ON DELETE CASCADE" constraint -> and do a DELETE, INSERT of a PK. Do the FK rows need to be -> deleted or not? - -Good example. I think FK should not be deleted. If someone really -want to delete "old" FK then he can do - -DELETE PK; -SET CONSTRAINT ... IMMEDIATE; -- FK need to be deleted here -INSERT PK; - -> Consider you have a deferred "ON DELETE RESTRICT" and "ON -> UPDATE CASCADE" constraint. If you DELETE PK1 and UPDATE PK2 -> to PK1, the FK2 rows need to follow, but does PK2 inherit all -> FK1 rows now so it's the master of both groups? - -Yes. Again one can use SET CONSTRAINT to achieve desirable results. -It seems that SET CONSTRAINT was designed for these purposes - ie -for better flexibility. - -Though, it would be better to look how other DBes handle all these -cases -:) - -Vadim - -From janwieck@yahoo.com Fri Jan 26 12:20:27 2001 -Received: from smtp6.mail.yahoo.com (smtp6.mail.yahoo.com [128.11.69.103]) - by candle.pha.pa.us (8.9.0/8.9.0) with SMTP id MAA22158 - for ; Fri, 26 Jan 2001 12:20:27 -0500 (EST) -Received: from j13.us.greatbridge.com (HELO jupiter.greatbridge.com) (216.54.52.153) - by smtp.mail.vip.suc.yahoo.com with SMTP; 26 Jan 2001 17:20:26 -0000 -X-Apparently-From: -Received: (from janwieck@localhost) - by jupiter.greatbridge.com (8.9.3/8.9.3) id MAA03196; - Fri, 26 Jan 2001 12:30:05 -0500 -From: Jan Wieck -Message-Id: <200101261730.MAA03196@jupiter.greatbridge.com> -Subject: Re: [HACKERS] Open 7.1 items -To: PostgreSQL HACKERS , - Bruce Momjian -Date: Fri, 26 Jan 2001 12:30:05 -0500 (EST) -X-Mailer: ELM [version 2.4ME+ PL68 (25)] -MIME-Version: 1.0 -Content-Type: text/plain; charset=US-ASCII -Content-Transfer-Encoding: 7bit -Status: RO - -Bruce Momjian wrote: -> Here are my open 7.1 items. Thanks for shrinking the list so far. -> -> --------------------------------------------------------------------------- -> -> FreeBSD locale bug -> Reorder INSERT firing in rules - - I don't recall why this is wanted. AFAIK there's no reason - NOT to do so, except for the actual state of beeing far too - close to a release candidate. - -> Philip Warner UPDATE crash -> JDBC LargeObject short read return value missing -> SELECT cash_out(1) crashes all backends -> LAZY VACUUM -> FOREIGN KEY INSERT & UPDATE/DELETE in transaction "change violation" - - A well known issue, and I've asked multiple times how exactly - we want to define the behaviour for deferred constraints. Do - foreign keys reference just to a key value and are happy with - it's existance, or do they refer to a particular row? - - Consider you have a deferred "ON DELETE CASCADE" constraint - and do a DELETE, INSERT of a PK. Do the FK rows need to be - deleted or not? - - Consider you have a deferred "ON DELETE RESTRICT" and "ON - UPDATE CASCADE" constraint. If you DELETE PK1 and UPDATE PK2 - to PK1, the FK2 rows need to follow, but does PK2 inherit all - FK1 rows now so it's the master of both groups? - - These are only two possible combinations. There are many to - think of. As said, I've asked before, but noone voted yet. - Move the item to 7.2 anyway, because changing this behaviour - would require massive changes in the trigger queue *and* the - generic RI triggers, which cannot be tested enough any more. - - -Jan - -> Usernames limited in length -> Does pg_dump preserve COMMENTs? -> Failure of nested cursors in JDBC -> JDBC setMaxRows() is global variable affecting other objects -> Does JDBC Makefile need current dir? -> Fix for pg_dump of bad system tables -> Steve Howe failure query with rules -> ODBC/JDBC not disconnecting properly? -> Magnus Hagander ODBC issues? -> Merge MySQL/PgSQL translation scripts -> Fix ipcclean on Linux -> Merge global and template BKI files? -> -> -> -- -> Bruce Momjian | http://candle.pha.pa.us -> pgman@candle.pha.pa.us | (610) 853-3000 -> + If your life is a hard drive, | 830 Blythe Avenue -> + Christ can be your backup. | Drexel Hill, Pennsylvania 19026 -> - - --- - -#======================================================================# -# It's easier to get forgiveness for being wrong than for being right. # -# Let's break this rule - forgive me. # -#================================================== JanWieck@Yahoo.com # - - -_________________________________________________________ -Do You Yahoo!? -Get your free @yahoo.com address at http://mail.yahoo.com - - -From pgsql-general-owner+M590@postgresql.org Tue Nov 14 16:30:40 2000 -Received: from mail.postgresql.org (webmail.postgresql.org [216.126.85.28]) - by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id RAA22313 - for ; Tue, 14 Nov 2000 17:30:39 -0500 (EST) -Received: from mail.postgresql.org (webmail.postgresql.org [216.126.85.28]) - by mail.postgresql.org (8.11.1/8.11.1) with SMTP id eAEMSJs66979; - Tue, 14 Nov 2000 17:28:21 -0500 (EST) - (envelope-from pgsql-general-owner+M590@postgresql.org) -Received: from megazone23.bigpanda.com (138.210.6.64.reflexcom.com [64.6.210.138]) - by mail.postgresql.org (8.11.1/8.11.1) with ESMTP id eAEMREs66800 - for ; Tue, 14 Nov 2000 17:27:14 -0500 (EST) - (envelope-from sszabo@megazone23.bigpanda.com) -Received: from localhost (sszabo@localhost) - by megazone23.bigpanda.com (8.11.1/8.11.0) with ESMTP id eAEMPpH69059; - Tue, 14 Nov 2000 14:25:51 -0800 (PST) -Date: Tue, 14 Nov 2000 14:25:51 -0800 (PST) -From: Stephan Szabo -To: "Beth K. Gatewood" -cc: pgsql-general@postgresql.org -Subject: Re: [GENERAL] a request for some experienced input..... -In-Reply-To: <3A11ACA1.E5D847DD@mbt.washington.edu> -Message-ID: -MIME-Version: 1.0 -Content-Type: TEXT/PLAIN; charset=US-ASCII -Precedence: bulk -Sender: pgsql-general-owner@postgresql.org -Status: OR - - -On Tue, 14 Nov 2000, Beth K. Gatewood wrote: - -> > -> -> Stephan- -> -> Thank you so much for taking the effort to answer this these questions. You -> help is truly appreciated.... -> -> I just have a few points for clarification. -> -> > -> > MATCH PARTIAL is a specific match type which describes which rows are -> > considered matching rows for purposes of meeting or failing the -> > constraint. (In match partial, a fktable (NULL, 2) would match a pk -> > table (1,2) as well as a pk table (2,2). It's different from match -> > full in which case (NULL,2) would be invalid or match unspecified -> > in which case it would match due to the existance of the NULL in any -> > case). There are some bizarre implementation details involved with -> > it and it's different from the others in ways that make it difficult. -> > It's in my list of things to do, but I haven't come up with an acceptable -> > mechanism in my head yet. -> -> Does this mean, currently that I can not have foreign keys with null values? - -Not exactly... - -Match full = In FK row, all columns must be NULL or the value of each - column must not be null and there is a row in the PK table where - each referencing column equals the corresponding referenced - column. - -Unspecified = In FK row, at least one column must be NULL or each - referencing column shall be equal to the corresponding referenced - column in some row of the referenced table - -Match partial is similar to match full except we ignore the null columns - for purposes of the each referencing column equals bit. - -For example: - PK Table Key values: (1,2), (1,3), (3,3) - Attempted FK Table Key values: (1,2), (1,NULL), (5,NULL), (NULL, NULL) - (hopefully I get this right)... - In match full, only the 1st and 4th fk values are valid. - In match partial, the 1st, 2nd, and 4th fk values are valid. - In match unspecified, all the fk values are valid. - -The other note is that generally speaking, all three are basically the -same for the single column key. If you're only doing references on one -column, the match type is mostly meaningless. - -> > PENDANT adds that for each row of the referenced table the values of -> > the specified column(s) are the same as the values of the specified -> > column(s) in some row of the referencing tables. -> -> I am not sure I know what you mean here.....Are you saying that the value for -> the FK column must match the value for the PK column? - -I haven't really looked at PENDANT, the above was just a small rewrite of -some descriptive text in the sql99 draft I have. There's a whole bunch -of rules in the actual text of the referential constraint definition. - -The base stuff seems to be: (Rf is the referencing columns, T is the -referenced table) - - 3) If PENDANT is specified, then: - a) For a given row in the referencing table, let pendant - reference designate an instance in which all Rf are - non-null. - - b) Let number of pendant paths be the number of pendant - references to the same referenced row in a referenced table - from all referencing rows in all base tables. - - c) For every row in T, the number of pendant paths is equal to - or greater than 1. - -So, I'd read it as every row in T must have at least one referencing row -in some base table. - -There are some details about updates and that you can't mix PENDANT and -MATCH PARTIAL or SET DEFAULT actions. - -> > The main issues in 7.0 are that older versions (might be fixed in -> > 7.0.3) would fail very badly if you used alter table to rename tables that -> > were referenced in a fk constraint and that you need to give update -> > permission to the referenced table. For the former, 7.1 will (and 7.0.3 -> > may) give an elog(ERROR) to you rather than crashing the backend and the -> > latter should be fixed for 7.1 (although you still need to have write -> > perms to the referencing table for referential actions to work properly) -> -> Are the steps to this outlined somewhere then? - -The permissions stuff is just a matter of using GRANT and REVOKE to set -the permissions that a user has to a table. - - diff --git a/doc/TODO.detail/fsync b/doc/TODO.detail/fsync deleted file mode 100644 index 6163dc4319..0000000000 --- a/doc/TODO.detail/fsync +++ /dev/null @@ -1,129 +0,0 @@ -From pgsql-hackers-owner+M908@postgresql.org Sun Nov 19 14:27:43 2000 -Received: from mail.postgresql.org (webmail.postgresql.org [216.126.85.28]) - by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id OAA10885 - for ; Sun, 19 Nov 2000 14:27:42 -0500 (EST) -Received: from mail.postgresql.org (webmail.postgresql.org [216.126.85.28]) - by mail.postgresql.org (8.11.1/8.11.1) with SMTP id eAJJSMs83653; - Sun, 19 Nov 2000 14:28:22 -0500 (EST) - (envelope-from pgsql-hackers-owner+M908@postgresql.org) -Received: from candle.pha.pa.us (candle.navpoint.com [162.33.245.46] (may be forged)) - by mail.postgresql.org (8.11.1/8.11.1) with ESMTP id eAJJQns83565 - for ; Sun, 19 Nov 2000 14:26:49 -0500 (EST) - (envelope-from pgman@candle.pha.pa.us) -Received: (from pgman@localhost) - by candle.pha.pa.us (8.9.0/8.9.0) id OAA06790; - Sun, 19 Nov 2000 14:23:06 -0500 (EST) -From: Bruce Momjian -Message-Id: <200011191923.OAA06790@candle.pha.pa.us> -Subject: Re: [HACKERS] WAL fsync scheduling -In-Reply-To: <002101c0525e$2d964480$b97a30d0@sectorbase.com> "from Vadim Mikheev - at Nov 19, 2000 11:23:19 am" -To: Vadim Mikheev -Date: Sun, 19 Nov 2000 14:23:06 -0500 (EST) -CC: Tom Samplonius , Alfred@candle.pha.pa.us, - Perlstein , Larry@candle.pha.pa.us, - Rosenman , - PostgreSQL-development -X-Mailer: ELM [version 2.4ME+ PL77 (25)] -MIME-Version: 1.0 -Content-Transfer-Encoding: 7bit -Content-Type: text/plain; charset=US-ASCII -Precedence: bulk -Sender: pgsql-hackers-owner@postgresql.org -Status: OR - -[ Charset ISO-8859-1 unsupported, converting... ] -> > There are two parts to transaction commit. The first is writing all -> > dirty buffers or log changes to the kernel, and second is fsync of the -> ^^^^^^^^^^^^ -> Backend doesn't write any dirty buffer to the kernel at commit time. - -Yes, I suspected that. - -> -> > log file. -> -> The first part is writing commit record into WAL buffers in shmem. -> This is what XLogInsert does. After that XLogFlush is called to ensure -> that entire commit record is on disk. XLogFlush does *both* write() and -> fsync() (single slock is used for both writing and fsyncing) if it needs to -> do it at all. - -Yes, I realize there are new steps in WAL. - -> -> > I suggest having a per-backend shared memory byte that has the following -> > values: -> > -> > START_LOG_WRITE -> > WAIT_ON_FSYNC -> > NOT_IN_COMMIT -> > backend_number_doing_fsync -> > -> > I suggest that when each backend starts a commit, it sets its byte to -> > START_LOG_WRITE. -> ^^^^^^^^^^^^^^^^^^^^^^^ -> Isn't START_COMMIT more meaningful? - -Yes. - -> -> > When it gets ready to fsync, it checks all backends. -> ^^^^^^^^^^^^^^^^^^^^^^^^^^ -> What do you mean by this? The moment just after XLogInsert? - -Just before it calls fsync(). - -> -> > If all are NOT_IN_COMMIT, it does fsync and continues. -> -> 1st edition: -> > If one or more are in START_LOG_WRITE, it waits until no one is in -> > START_LOG_WRITE. It then checks all WAIT_ON_FSYNC, and if it is the -> > lowest backend in WAIT_ON_FSYNC, marks all others with its backend -> > number, and does fsync. It then clears all backends with its number to -> > NOT_IN_COMMIT. Other backend will see they are not the lowest -> > WAIT_ON_FSYNC and will wait for their byte to be set to NOT_IN_COMMIT -> > so they can then continue, knowing their data was synced. -> -> 2nd edition: -> > I have another idea. If a backend gets to the point that it needs -> > fsync, and there is another backend in START_LOG_WRITE, it can go to an -> > interuptable sleep, knowing another backend will perform the fsync and -> > wake it up. Therefore, there is no busy-wait or timed sleep. -> > -> > Of course, a backend must set its status to WAIT_ON_FSYNC to avoid a -> > race condition. -> -> The 2nd edition is much better. But I'm not sure do we really need in -> these per-backend bytes in shmem. Why not just have some counters? -> We can use a semaphore to wake-up all waiters at once. - -Yes, that is much better and clearer. My idea was just to say, "if no -one is entering commit phase, do the commit. If someone else is coming, -sleep and wait for them to do the fsync and wake me up with a singal." - -> -> > This allows a single backend not to sleep, and allows multiple backends -> > to bunch up only when they are all about to commit. -> > -> > The reason backend numbers are written is so other backends entering the -> > commit code will not interfere with the backends performing fsync. -> -> Being waked-up backend can check what's written/fsynced by calling XLogFlush. - -Seems that may not be needed anymore with a counter. The only issue is -that other backends may enter commit while fsync() is happening. The -process that did the fsync must be sure to wake up only the backends -that were waiting for it, and not other backends that may be also be -doing fsync as a group while the first fsync was happening. I leave -those details to people more experienced. :-) - -I am just glad people liked my idea. - --- - Bruce Momjian | http://candle.pha.pa.us - pgman@candle.pha.pa.us | (610) 853-3000 - + If your life is a hard drive, | 830 Blythe Avenue - + Christ can be your backup. | Drexel Hill, Pennsylvania 19026 - diff --git a/doc/TODO.detail/optimizer b/doc/TODO.detail/optimizer deleted file mode 100644 index fd6324f367..0000000000 --- a/doc/TODO.detail/optimizer +++ /dev/null @@ -1,2406 +0,0 @@ -From owner-pgsql-hackers@hub.org Mon Mar 22 18:43:41 1999 -Received: from renoir.op.net (root@renoir.op.net [209.152.193.4]) - by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id SAA23978 - for ; Mon, 22 Mar 1999 18:43:39 -0500 (EST) -Received: from hub.org (majordom@hub.org [209.47.145.100]) by renoir.op.net (o1/$ Revision: 1.18 $) with ESMTP id SAA06472 for ; Mon, 22 Mar 1999 18:36:44 -0500 (EST) -Received: from localhost (majordom@localhost) - by hub.org (8.9.2/8.9.1) with SMTP id SAA92604; - Mon, 22 Mar 1999 18:34:23 -0500 (EST) - (envelope-from owner-pgsql-hackers@hub.org) -Received: by hub.org (TLB v0.10a (1.23 tibbs 1997/01/09 00:29:32)); Mon, 22 Mar 1999 18:33:50 +0000 (EST) -Received: (from majordom@localhost) - by hub.org (8.9.2/8.9.1) id SAA92469 - for pgsql-hackers-outgoing; Mon, 22 Mar 1999 18:33:47 -0500 (EST) - (envelope-from owner-pgsql-hackers@postgreSQL.org) -Received: from po8.andrew.cmu.edu (PO8.ANDREW.CMU.EDU [128.2.10.108]) - by hub.org (8.9.2/8.9.1) with ESMTP id SAA92456 - for ; Mon, 22 Mar 1999 18:33:41 -0500 (EST) - (envelope-from er1p+@andrew.cmu.edu) -Received: (from postman@localhost) by po8.andrew.cmu.edu (8.8.5/8.8.2) id SAA12894 for pgsql-hackers@postgresql.org; Mon, 22 Mar 1999 18:33:38 -0500 (EST) -Received: via switchmail; Mon, 22 Mar 1999 18:33:38 -0500 (EST) -Received: from cloudy.me.cmu.edu via qmail - ID ; - Mon, 22 Mar 1999 18:27:20 -0500 (EST) -Received: from cloudy.me.cmu.edu via qmail - ID ; - Mon, 22 Mar 1999 18:27:17 -0500 (EST) -Received: from mms.4.60.Jun.27.1996.03.05.56.sun4.41.EzMail.2.0.CUILIB.3.45.SNAP.NOT.LINKED.cloudy.me.cmu.edu.sun4m.412 - via MS.5.6.cloudy.me.cmu.edu.sun4_41; - Mon, 22 Mar 1999 18:27:15 -0500 (EST) -Message-ID: -Date: Mon, 22 Mar 1999 18:27:15 -0500 (EST) -From: Erik Riedel -To: pgsql-hackers@postgreSQL.org -Subject: [HACKERS] optimizer and type question -Sender: owner-pgsql-hackers@postgreSQL.org -Precedence: bulk -Status: RO - - -[last week aggregation, this week, the optimizer] - -I have a somewhat general optimizer question/problem that I would like -to get some input on - i.e. I'd like to know what is "supposed" to -work here and what I should be expecting. Sadly, I think the patch -for this is more involved than my last message. - -Using my favorite table these days: - -Table = lineitem -+------------------------+----------------------------------+-------+ -| Field | Type | Length| -+------------------------+----------------------------------+-------+ -| l_orderkey | int4 not null | 4 | -| l_partkey | int4 not null | 4 | -| l_suppkey | int4 not null | 4 | -| l_linenumber | int4 not null | 4 | -| l_quantity | float4 not null | 4 | -| l_extendedprice | float4 not null | 4 | -| l_discount | float4 not null | 4 | -| l_tax | float4 not null | 4 | -| l_returnflag | char() not null | 1 | -| l_linestatus | char() not null | 1 | -| l_shipdate | date | 4 | -| l_commitdate | date | 4 | -| l_receiptdate | date | 4 | -| l_shipinstruct | char() not null | 25 | -| l_shipmode | char() not null | 10 | -| l_comment | char() not null | 44 | -+------------------------+----------------------------------+-------+ -Index: lineitem_index_ - -and the query: - --- --- Query 1 --- -explain select l_returnflag, l_linestatus, sum(l_quantity) as sum_qty, -sum(l_extendedprice) as sum_base_price, -sum(l_extendedprice*(1-l_discount)) as sum_disc_price, -sum(l_extendedprice*(1-l_discount)*(1+l_tax)) as sum_charge, -avg(l_quantity) as avg_qty, avg(l_extendedprice) as avg_price, -avg(l_discount) as avg_disc, count(*) as count_order -from lineitem -where l_shipdate <= '1998-09-02'::date -group by l_returnflag, l_linestatus -order by l_returnflag, l_linestatus; - - -note that I have eliminated the date calculation in my query of last -week and manually replaced it with a constant (since this wasn't -happening automatically - but let's not worry about that for now). -And this is only an explain, we care about the optimizer. So we get: - -Sort (cost=34467.88 size=0 width=0) - -> Aggregate (cost=34467.88 size=0 width=0) - -> Group (cost=34467.88 size=0 width=0) - -> Sort (cost=34467.88 size=0 width=0) - -> Seq Scan on lineitem (cost=34467.88 size=200191 width=44) - -so let's think about the selectivity that is being chosen for the -seq scan (the where l_shipdate <= '1998-09-02'). - -Turns out the optimizer is choosing "33%", even though the real answer -is somewhere in 90+% (that's how the query is designed). So, why does -it do that? - -Turns out that selectivity in this case is determined via -plancat::restriction_selectivity() which calls into functionOID = 103 -(intltsel) for operatorOID = 1096 (date "<=") on relation OID = 18663 -(my lineitem). - -This all follows because of the description of 1096 (date "<=") in -pg_operator. Looking at local1_template1.bki.source near line 1754 -shows: - -insert OID = 1096 ( "<=" PGUID 0 <...> date_le intltsel intltjoinsel ) - -where we see that indeed, it thinks "intltsel" is the right function -to use for "oprrest" in the case of dates. - -Question 1 - is intltsel the right thing for selectivity on dates? - -Hope someone is still with me. - -So now we're running selfuncs::intltsel() where we make a further call -to selfuncs::gethilokey(). The job of gethilokey is to determine the -min and max values of a particular attribute in the table, which will -then be used with the constant in my where clause to estimate the -selectivity. It is going to search the pg_statistic relation with -three key values: - -Anum_pg_statistic_starelid 18663 (lineitem) -Anum_pg_statistic_staattnum 11 (l_shipdate) -Anum_pg_statistic_staop 1096 (date "<=") - -this finds no tuples in pg_statistic. Why is that? The only nearby -tuple in pg_statistic is: - -starelid|staattnum|staop|stalokey |stahikey ---------+---------+-----+----------------+---------------- - 18663| 11| 0|01-02-1992 |12-01-1998 - -and the reason the query doesn't match anything? Because 1096 != 0. -But why is it 0 in pg_statistic? Statistics are determined near line -1844 in vacuum.c (assuming a 'vacuum analyze' run at some point) - - i = 0; - values[i++] = (Datum) relid; /* 1 */ - values[i++] = (Datum) attp->attnum; /* 2 */ -====> values[i++] = (Datum) InvalidOid; /* 3 */ - fmgr_info(stats->outfunc, &out_function); - out_string = <...min...> - values[i++] = (Datum) fmgr(F_TEXTIN, out_string); - pfree(out_string); - out_string = <...max...> - values[i++] = (Datum) fmgr(F_TEXTIN, out_string); - pfree(out_string); - stup = heap_formtuple(sd->rd_att, values, nulls); - -the "offending" line is setting the staop to InvalidOid (i.e. 0). - -Question 2 - is this right? Is the intent for 0 to serve as a -"wildcard", or should it be inserting an entry for each operation -individually? - -In the case of "wildcard" then gethilokey() should allow a match for - -Anum_pg_statistic_staop 0 - -instead of requiring the more restrictive 1096. In the current code, -what happens next is gethilokey() returns "not found" and intltsel() -returns the default 1/3 which I see in the resultant query plan (size -= 200191 is 1/3 of the number of lineitem tuples). - -Question 3 - is there any inherent reason it couldn't get this right? -The statistic is in the table 1992 to 1998, so the '1998-09-02' date -should be 90-some% selectivity, a much better guess than 33%. - -Doesn't make a difference for this particular query, of course, -because the seq scan must proceed anyhow, but it could easily affect -other queries where selectivities matter (and it affects the -modifications I am trying to test in the optimizer to be "smarter" -about selectivities - my overall context is to understand/improve the -behavior that the underlying storage system sees from queries like this). - -OK, so let's say we treat 0 as a "wildcard" and stop checking for -1096. Not we let gethilokey() return the two dates from the statistic -table. The immediate next thing that intltsel() does, near lines 122 -in selfuncs.c is call atol() on the strings from gethilokey(). And -guess what it comes up with? - -low = 1 -high = 12 - -because it calls atol() on '01-02-1992' and '12-01-1998'. This -clearly isn't right, it should get some large integer that includes -the year and day in the result. Then it should compare reasonably -with my constant from the where clause and give a decent selectivity -value. This leads to a re-visit of Question 1. - -Question 4 - should date "<=" use a dateltsel() function instead of -intltsel() as oprrest? - -If anyone is still with me, could you tell me if this makes sense, or -if there is some other location where the appropriate type conversion -could take place so that intltsel() gets something reasonable when it -does the atol() calls? - -Could someone also give me a sense for how far out-of-whack the whole -current selectivity-handling structure is? It seems that most of the -operators in pg_operator actually use intltsel() and would have -type-specific problems like that described. Or is the problem in the -way attribute values are stored in pg_statistic by vacuum analyze? Or -is there another layer where type conversion belongs? - -Phew. Enough typing, hope someone can follow this and address at -least some of the questions. - -Thanks. - -Erik Riedel -Carnegie Mellon University -www.cs.cmu.edu/~riedel - - - -From owner-pgsql-hackers@hub.org Mon Mar 22 20:31:11 1999 -Received: from renoir.op.net (root@renoir.op.net [209.152.193.4]) - by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id UAA00802 - for ; Mon, 22 Mar 1999 20:31:09 -0500 (EST) -Received: from hub.org (majordom@hub.org [209.47.145.100]) by renoir.op.net (o1/$ Revision: 1.18 $) with ESMTP id UAA13231 for ; Mon, 22 Mar 1999 20:15:20 -0500 (EST) -Received: from localhost (majordom@localhost) - by hub.org (8.9.2/8.9.1) with SMTP id UAA01981; - Mon, 22 Mar 1999 20:14:04 -0500 (EST) - (envelope-from owner-pgsql-hackers@hub.org) -Received: by hub.org (TLB v0.10a (1.23 tibbs 1997/01/09 00:29:32)); Mon, 22 Mar 1999 20:13:32 +0000 (EST) -Received: (from majordom@localhost) - by hub.org (8.9.2/8.9.1) id UAA01835 - for pgsql-hackers-outgoing; Mon, 22 Mar 1999 20:13:28 -0500 (EST) - (envelope-from owner-pgsql-hackers@postgreSQL.org) -Received: from sss.sss.pgh.pa.us (sss.pgh.pa.us [206.210.65.6]) - by hub.org (8.9.2/8.9.1) with ESMTP id UAA01822 - for ; Mon, 22 Mar 1999 20:13:21 -0500 (EST) - (envelope-from tgl@sss.pgh.pa.us) -Received: from sss.sss.pgh.pa.us (localhost [127.0.0.1]) - by sss.sss.pgh.pa.us (8.9.1/8.9.1) with ESMTP id UAA23294; - Mon, 22 Mar 1999 20:12:43 -0500 (EST) -To: Erik Riedel -cc: pgsql-hackers@postgreSQL.org -Subject: Re: [HACKERS] optimizer and type question -In-reply-to: Your message of Mon, 22 Mar 1999 18:27:15 -0500 (EST) - -Date: Mon, 22 Mar 1999 20:12:43 -0500 -Message-ID: <23292.922151563@sss.pgh.pa.us> -From: Tom Lane -Sender: owner-pgsql-hackers@postgreSQL.org -Precedence: bulk -Status: ROr - -Erik Riedel writes: -> [ optimizer doesn't find relevant pg_statistic entry ] - -It's clearly a bug that the selectivity code is not finding this tuple. -If your analysis is correct, then selectivity estimation has *never* -worked properly, or at least not in recent memory :-(. Yipes. -Bruce and I found a bunch of other problems in the optimizer recently, -so it doesn't faze me to assume that this is broken too. - -> the "offending" line is setting the staop to InvalidOid (i.e. 0). -> Question 2 - is this right? Is the intent for 0 to serve as a -> "wildcard", - -My thought is that what the staop column ought to be is the OID of the -comparison function that was used to determine the sort order of the -column. Without a sort op the lowest and highest keys in the column are -not well defined, so it makes no sense to assert "these are the lowest -and highest values" without providing the sort op that determined that. -(For sufficiently complex data types one could reasonably have multiple -ordering operators. A crude example is sorting on "circumference" and -"area" for polygons.) But typically the sort op will be the "<" -operator for the column data type. - -So, the vacuum code is definitely broken --- it's not storing the sort -op that it used. The code in gethilokey might be broken too, depending -on how it is producing the operator it's trying to match against the -tuple. For example, if the actual operator in the query is any of -< <= > >= on int4, then int4lt ought to be used to probe the pg_statistic -table. I'm not sure if we have adequate info in pg_operator or pg_type -to let the optimizer code determine the right thing to probe with :-( - -> The immediate next thing that intltsel() does, near lines 122 -> in selfuncs.c is call atol() on the strings from gethilokey(). And -> guess what it comes up with? -> low = 1 -> high = 12 -> because it calls atol() on '01-02-1992' and '12-01-1998'. This -> clearly isn't right, it should get some large integer that includes -> the year and day in the result. Then it should compare reasonably -> with my constant from the where clause and give a decent selectivity -> value. This leads to a re-visit of Question 1. -> Question 4 - should date "<=" use a dateltsel() function instead of -> intltsel() as oprrest? - -This is clearly busted as well. I'm not sure that creating dateltsel() -is the right fix, however, because if you go down that path then every -single datatype needs its own selectivity function; that's more than we -need. - -What we really want here is to be able to map datatype values into -some sort of numeric range so that we can compute what fraction of the -low-key-to-high-key range is on each side of the probe value (the -constant taken from the query). This general concept will apply to -many scalar types, so what we want is a type-specific mapping function -and a less-specific fraction-computing-function. Offhand I'd say that -we want intltsel() and floatltsel(), plus conversion routines that can -produce either int4 or float8 from a data type as seems appropriate. -Anything that couldn't map to one or the other would have to supply its -own selectivity function. - -> Or is the problem in the -> way attribute values are stored in pg_statistic by vacuum analyze? - -Looks like it converts the low and high values to text and stores them -that way. Ugly as can be :-( but I'm not sure there is a good -alternative. We have no "wild card" column type AFAIK, which is what -these columns of pg_statistic would have to be to allow storage of -unconverted min and max values. - -I think you've found a can of worms here. Congratulations ;-) - - regards, tom lane - - -From owner-pgsql-hackers@hub.org Mon Mar 22 23:31:00 1999 -Received: from renoir.op.net (root@renoir.op.net [209.152.193.4]) - by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id XAA03384 - for ; Mon, 22 Mar 1999 23:30:58 -0500 (EST) -Received: from hub.org (majordom@hub.org [209.47.145.100]) by renoir.op.net (o1/$ Revision: 1.18 $) with ESMTP id XAA25586 for ; Mon, 22 Mar 1999 23:18:25 -0500 (EST) -Received: from localhost (majordom@localhost) - by hub.org (8.9.2/8.9.1) with SMTP id XAA17955; - Mon, 22 Mar 1999 23:17:24 -0500 (EST) - (envelope-from owner-pgsql-hackers@hub.org) -Received: by hub.org (TLB v0.10a (1.23 tibbs 1997/01/09 00:29:32)); Mon, 22 Mar 1999 23:16:49 +0000 (EST) -Received: (from majordom@localhost) - by hub.org (8.9.2/8.9.1) id XAA17764 - for pgsql-hackers-outgoing; Mon, 22 Mar 1999 23:16:46 -0500 (EST) - (envelope-from owner-pgsql-hackers@postgreSQL.org) -Received: from po8.andrew.cmu.edu (PO8.ANDREW.CMU.EDU [128.2.10.108]) - by hub.org (8.9.2/8.9.1) with ESMTP id XAA17745 - for ; Mon, 22 Mar 1999 23:16:39 -0500 (EST) - (envelope-from er1p+@andrew.cmu.edu) -Received: (from postman@localhost) by po8.andrew.cmu.edu (8.8.5/8.8.2) id XAA04273; Mon, 22 Mar 1999 23:16:37 -0500 (EST) -Received: via switchmail; Mon, 22 Mar 1999 23:16:37 -0500 (EST) -Received: from hazy.adsl.net.cmu.edu via qmail - ID ; - Mon, 22 Mar 1999 23:15:09 -0500 (EST) -Received: from hazy.adsl.net.cmu.edu via qmail - ID ; - Mon, 22 Mar 1999 23:15:00 -0500 (EST) -Received: from mms.4.60.Jun.27.1996.03.02.53.sun4.51.EzMail.2.0.CUILIB.3.45.SNAP.NOT.LINKED.hazy.adsl.net.cmu.edu.sun4m.54 - via MS.5.6.hazy.adsl.net.cmu.edu.sun4_51; - Mon, 22 Mar 1999 23:14:55 -0500 (EST) -Message-ID: <4qxlJ0200anI01hK40@andrew.cmu.edu> -Date: Mon, 22 Mar 1999 23:14:55 -0500 (EST) -From: Erik Riedel -To: Tom Lane -Subject: Re: [HACKERS] optimizer and type question -Cc: pgsql-hackers@postgreSQL.org -In-Reply-To: <23292.922151563@sss.pgh.pa.us> -References: <23292.922151563@sss.pgh.pa.us> -Sender: owner-pgsql-hackers@postgreSQL.org -Precedence: bulk -Status: ROr - - -OK, building on your high-level explanation, I am attaching a patch that -attempts to do something "better" than the current code. Note that I -have only tested this with the date type and my particular query. I -haven't run it through the regression, so consider it "proof of concept" -at best. Although hopefully it will serve my purposes. - -> My thought is that what the staop column ought to be is the OID of the -> comparison function that was used to determine the sort order of the -> column. Without a sort op the lowest and highest keys in the column are -> not well defined, so it makes no sense to assert "these are the lowest -> and highest values" without providing the sort op that determined that. -> -> (For sufficiently complex data types one could reasonably have multiple -> ordering operators. A crude example is sorting on "circumference" and -> "area" for polygons.) But typically the sort op will be the "<" -> operator for the column data type. -> -I changed vacuum.c to do exactly that. oid of the lt sort op. - -> So, the vacuum code is definitely broken --- it's not storing the sort -> op that it used. The code in gethilokey might be broken too, depending -> on how it is producing the operator it's trying to match against the -> tuple. For example, if the actual operator in the query is any of -> < <= > >= on int4, then int4lt ought to be used to probe the pg_statistic -> table. I'm not sure if we have adequate info in pg_operator or pg_type -> to let the optimizer code determine the right thing to probe with :-( -> -This indeed seems like a bigger problem. I thought about somehow using -type-matching from the sort op and the actual operator in the query - if -both the left and right type match, then consider them the same for -purposes of this probe. That seemed complicated, so I punted in my -example - it just does the search with relid and attnum and assumes that -only returns one tuple. This works in my case (maybe in all cases, -because of the way vacuum is currently written - ?). - -> What we really want here is to be able to map datatype values into -> some sort of numeric range so that we can compute what fraction of the -> low-key-to-high-key range is on each side of the probe value (the -> constant taken from the query). This general concept will apply to -> many scalar types, so what we want is a type-specific mapping function -> and a less-specific fraction-computing-function. Offhand I'd say that -> we want intltsel() and floatltsel(), plus conversion routines that can -> produce either int4 or float8 from a data type as seems appropriate. -> Anything that couldn't map to one or the other would have to supply its -> own selectivity function. -> -This is what my example then does. Uses the stored sort op to get the -type and then uses typinput to convert from the string to an int4. - -Then puts the int4 back into string format because that's what everyone -was expecting. - -It seems to work for my particular query. I now get: - -(selfuncs) gethilokey() obj 18663 attr 11 opid 1096 (ignored) -(selfuncs) gethilokey() found op 1087 in pg_proc -(selfuncs) gethilokey() found type 1082 in pg_type -(selfuncs) gethilokey() going to use 1084 to convert type 1082 -(selfuncs) gethilokey() have low -2921 high -396 -(selfuncs) intltsel() high -396 low -2921 val -486 -(plancat) restriction_selectivity() for func 103 op 1096 rel 18663 attr -11 const -486 flag 3 returns 0.964356 -NOTICE: QUERY PLAN: - -Sort (cost=34467.88 size=0 width=0) - -> Aggregate (cost=34467.88 size=0 width=0) - -> Group (cost=34467.88 size=0 width=0) - -> Sort (cost=34467.88 size=0 width=0) - -> Seq Scan on lineitem (cost=34467.88 size=579166 width=44) - -including my printfs, which exist in the patch as well. - -Selectivity is now the expected 96% and the size estimate for the seq -scan is much closer to correct. - -Again, not tested with anything besides date, so caveat not-tested. - -Hope this helps. - -Erik - -----------------------[optimizer_fix.sh]------------------------ - -#! /bin/sh -# This is a shell archive, meaning: -# 1. Remove everything above the #! /bin/sh line. -# 2. Save the resulting text in a file. -# 3. Execute the file with /bin/sh (not csh) to create: -# selfuncs.c.diff -# vacuum.c.diff -# This archive created: Mon Mar 22 22:58:14 1999 -export PATH; PATH=/bin:/usr/bin:$PATH -if test -f 'selfuncs.c.diff' -then - echo shar: "will not over-write existing file 'selfuncs.c.diff'" -else -cat << \SHAR_EOF > 'selfuncs.c.diff' -*** -/afs/ece.cmu.edu/project/lcs/lcs-004/er1p/postgres/611/src/backend/utils/adt -/selfuncs.c Thu Mar 11 23:59:35 1999 ---- -/afs/ece.cmu.edu/project/lcs/lcs-004/er1p/postgres/615/src/backend/utils/adt -/selfuncs.c Mon Mar 22 22:57:25 1999 -*************** -*** 32,37 **** ---- 32,40 ---- - #include "utils/lsyscache.h" /* for get_oprrest() */ - #include "catalog/pg_statistic.h" - -+ #include "catalog/pg_proc.h" /* for Form_pg_proc */ -+ #include "catalog/pg_type.h" /* for Form_pg_type */ -+ - /* N is not a valid var/constant or relation id */ - #define NONVALUE(N) ((N) == -1) - -*************** -*** 103,110 **** - bottom; - - result = (float64) palloc(sizeof(float64data)); -! if (NONVALUE(attno) || NONVALUE(relid)) - *result = 1.0 / 3; - else - { - /* XXX val = atol(value); */ ---- 106,114 ---- - bottom; - - result = (float64) palloc(sizeof(float64data)); -! if (NONVALUE(attno) || NONVALUE(relid)) { - *result = 1.0 / 3; -+ } - else - { - /* XXX val = atol(value); */ -*************** -*** 117,130 **** - } - high = atol(highchar); - low = atol(lowchar); - if ((flag & SEL_RIGHT && val < low) || - (!(flag & SEL_RIGHT) && val > high)) - { - float32data nvals; - - nvals = getattdispersion(relid, (int) attno); -! if (nvals == 0) - *result = 1.0 / 3.0; - else - { - *result = 3.0 * (float64data) nvals; ---- 121,136 ---- - } - high = atol(highchar); - low = atol(lowchar); -+ printf("(selfuncs) intltsel() high %d low %d val %d\n",high,low,val); - if ((flag & SEL_RIGHT && val < low) || - (!(flag & SEL_RIGHT) && val > high)) - { - float32data nvals; - - nvals = getattdispersion(relid, (int) attno); -! if (nvals == 0) { - *result = 1.0 / 3.0; -+ } - else - { - *result = 3.0 * (float64data) nvals; -*************** -*** 336,341 **** ---- 342,353 ---- - { - Relation rel; - HeapScanDesc scan; -+ /* this assumes there is only one row in the statistics table for any -particular */ -+ /* relid, attnum pair - could be more complicated if staop is also -used. */ -+ /* at the moment, if there are multiple rows, this code ends up -picking the */ -+ /* "first" one - - er1p */ -+ /* the actual "ignoring" is done in the call to heap_beginscan() -below, where */ -+ /* we only mention 2 of the 3 keys in this array - - er1p */ - static ScanKeyData key[3] = { - {0, Anum_pg_statistic_starelid, F_OIDEQ, {0, 0, F_OIDEQ}}, - {0, Anum_pg_statistic_staattnum, F_INT2EQ, {0, 0, F_INT2EQ}}, -*************** -*** 344,355 **** - bool isnull; - HeapTuple tuple; - - rel = heap_openr(StatisticRelationName); - - key[0].sk_argument = ObjectIdGetDatum(relid); - key[1].sk_argument = Int16GetDatum((int16) attnum); - key[2].sk_argument = ObjectIdGetDatum(opid); -! scan = heap_beginscan(rel, 0, SnapshotNow, 3, key); - tuple = heap_getnext(scan, 0); - if (!HeapTupleIsValid(tuple)) - { ---- 356,377 ---- - bool isnull; - HeapTuple tuple; - -+ HeapTuple tup; -+ Form_pg_proc proc; -+ Form_pg_type typ; -+ Oid which_op; -+ Oid which_type; -+ int32 low_value; -+ int32 high_value; -+ - rel = heap_openr(StatisticRelationName); - - key[0].sk_argument = ObjectIdGetDatum(relid); - key[1].sk_argument = Int16GetDatum((int16) attnum); - key[2].sk_argument = ObjectIdGetDatum(opid); -! printf("(selfuncs) gethilokey() obj %d attr %d opid %d (ignored)\n", -! key[0].sk_argument,key[1].sk_argument,key[2].sk_argument); -! scan = heap_beginscan(rel, 0, SnapshotNow, 2, key); - tuple = heap_getnext(scan, 0); - if (!HeapTupleIsValid(tuple)) - { -*************** -*** 376,383 **** ---- 398,461 ---- - &isnull)); - if (isnull) - elog(DEBUG, "gethilokey: low key is null"); -+ - heap_endscan(scan); - heap_close(rel); -+ -+ /* now we deal with type conversion issues - */ -+ /* when intltsel() calls this routine (who knows what other callers -might do) */ -+ /* it assumes that it can call atol() on the strings and then use -integer */ -+ /* comparison from there. what we are going to do here, then, is try -to use */ -+ /* the type information from Anum_pg_statistic_staop to convert the -high */ -+ /* and low values -- er1p */ -+ -+ /* WARNING: this code has only been tested with the date type and has -NOT */ -+ /* been regression tested. consider it "sample" code of what might -be the */ -+ /* right kind of thing to do -- er1p */ -+ -+ /* get the 'op' from pg_statistic and look it up in pg_proc */ -+ which_op = heap_getattr(tuple, -+ Anum_pg_statistic_staop, -+ RelationGetDescr(rel), -+ &isnull); -+ if (InvalidOid == which_op) { -+ /* ignore all this stuff, try conversion only if we have a valid staop */ -+ /* note that there is an accompanying change to 'vacuum analyze' that */ -+ /* gets this set to something useful. */ -+ } else { -+ /* staop looks valid, so let's see what we can do about conversion */ -+ tup = SearchSysCacheTuple(PROOID, ObjectIdGetDatum(which_op), 0, 0, 0); -+ if (!HeapTupleIsValid(tup)) { -+ elog(ERROR, "selfuncs: unable to find op in pg_proc %d", which_op); -+ } -+ printf("(selfuncs) gethilokey() found op %d in pg_proc\n",which_op); -+ -+ /* use that to determine the type of stahikey and stalokey via pg_type */ -+ proc = (Form_pg_proc) GETSTRUCT(tup); -+ which_type = proc->proargtypes[0]; /* XXX - use left and right -separately? */ -+ tup = SearchSysCacheTuple(TYPOID, ObjectIdGetDatum(which_type), 0, 0, 0); -+ if (!HeapTupleIsValid(tup)) { -+ elog(ERROR, "selfuncs: unable to find type in pg_type %d", which_type); -+ } -+ printf("(selfuncs) gethilokey() found type %d in pg_type\n",which_type); -+ -+ /* and use that type to get the conversion function to int4 */ -+ typ = (Form_pg_type) GETSTRUCT(tup); -+ printf("(selfuncs) gethilokey() going to use %d to convert type -%d\n",typ->typinput,which_type); -+ -+ /* and convert the low and high strings */ -+ low_value = (int32) fmgr(typ->typinput, *low, -1); -+ high_value = (int32) fmgr(typ->typinput, *high, -1); -+ printf("(selfuncs) gethilokey() have low %d high -%d\n",low_value,high_value); -+ -+ /* now we have int4's, which we put back into strings because -that's what out */ -+ /* callers (intltsel() at least) expect - - er1p */ -+ pfree(*low); pfree(*high); /* let's not leak the old strings */ -+ *low = int4out(low_value); -+ *high = int4out(high_value); -+ -+ /* XXX - this probably leaks the two tups we got from -SearchSysCacheTuple() - er1p */ -+ } - } - - float64 -SHAR_EOF -fi -if test -f 'vacuum.c.diff' -then - echo shar: "will not over-write existing file 'vacuum.c.diff'" -else -cat << \SHAR_EOF > 'vacuum.c.diff' -*** -/afs/ece.cmu.edu/project/lcs/lcs-004/er1p/postgres/611/src/backend/commands/ -vacuum.c Thu Mar 11 23:59:09 1999 ---- -/afs/ece.cmu.edu/project/lcs/lcs-004/er1p/postgres/615/src/backend/commands/ -vacuum.c Mon Mar 22 21:23:15 1999 -*************** -*** 1842,1848 **** - i = 0; - values[i++] = (Datum) relid; /* 1 */ - values[i++] = (Datum) attp->attnum; /* 2 */ -! values[i++] = (Datum) InvalidOid; /* 3 */ - fmgr_info(stats->outfunc, &out_function); - out_string = (*fmgr_faddr(&out_function)) (stats->min, -stats->attr->atttypid); - values[i++] = (Datum) fmgr(F_TEXTIN, out_string); ---- 1842,1848 ---- - i = 0; - values[i++] = (Datum) relid; /* 1 */ - values[i++] = (Datum) attp->attnum; /* 2 */ -! values[i++] = (Datum) stats->f_cmplt.fn_oid; /* 3 */ /* get the -'<' oid, instead of 'invalid' - er1p */ - fmgr_info(stats->outfunc, &out_function); - out_string = (*fmgr_faddr(&out_function)) (stats->min, -stats->attr->atttypid); - values[i++] = (Datum) fmgr(F_TEXTIN, out_string); -SHAR_EOF -fi -exit 0 -# End of shell archive - - - -From owner-pgsql-hackers@hub.org Tue Mar 23 12:31:05 1999 -Received: from renoir.op.net (root@renoir.op.net [209.152.193.4]) - by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id MAA17491 - for ; Tue, 23 Mar 1999 12:31:04 -0500 (EST) -Received: from hub.org (majordom@hub.org [209.47.145.100]) by renoir.op.net (o1/$ Revision: 1.18 $) with ESMTP id MAA08839 for ; Tue, 23 Mar 1999 12:08:14 -0500 (EST) -Received: from localhost (majordom@localhost) - by hub.org (8.9.2/8.9.1) with SMTP id MAA93649; - Tue, 23 Mar 1999 12:04:57 -0500 (EST) - (envelope-from owner-pgsql-hackers@hub.org) -Received: by hub.org (TLB v0.10a (1.23 tibbs 1997/01/09 00:29:32)); Tue, 23 Mar 1999 12:03:00 +0000 (EST) -Received: (from majordom@localhost) - by hub.org (8.9.2/8.9.1) id MAA93355 - for pgsql-hackers-outgoing; Tue, 23 Mar 1999 12:02:55 -0500 (EST) - (envelope-from owner-pgsql-hackers@postgreSQL.org) -Received: from sss.sss.pgh.pa.us (sss.pgh.pa.us [206.210.65.6]) - by hub.org (8.9.2/8.9.1) with ESMTP id MAA93336 - for ; Tue, 23 Mar 1999 12:02:43 -0500 (EST) - (envelope-from tgl@sss.pgh.pa.us) -Received: from sss.sss.pgh.pa.us (localhost [127.0.0.1]) - by sss.sss.pgh.pa.us (8.9.1/8.9.1) with ESMTP id MAA24455; - Tue, 23 Mar 1999 12:01:57 -0500 (EST) -To: Erik Riedel -cc: pgsql-hackers@postgreSQL.org -Subject: Re: [HACKERS] optimizer and type question -In-reply-to: Your message of Mon, 22 Mar 1999 23:14:55 -0500 (EST) - <4qxlJ0200anI01hK40@andrew.cmu.edu> -Date: Tue, 23 Mar 1999 12:01:57 -0500 -Message-ID: <24453.922208517@sss.pgh.pa.us> -From: Tom Lane -Sender: owner-pgsql-hackers@postgreSQL.org -Precedence: bulk -Status: RO - -Erik Riedel writes: -> OK, building on your high-level explanation, I am attaching a patch that -> attempts to do something "better" than the current code. Note that I -> have only tested this with the date type and my particular query. - -Glad to see you working on this. I don't like the details of your -patch too much though ;-). Here are some suggestions for making it -better. - -1. I think just removing staop from the lookup in gethilokey is OK for -now, though I'm dubious about Bruce's thought that we could delete that -field entirely. As you observe, vacuum will not currently put more -than one tuple for a column into pg_statistic, so we can just do the -lookup with relid and attno and leave it at that. But I think we ought -to leave the field there, with the idea that vacuum might someday -compute more than one statistic for a data column. Fixing vacuum to -put its sort op into the field is a good idea in the meantime. - -2. The type conversion you're doing in gethilokey is a mess; I think -what you ought to make it do is simply the inbound conversion of the -string from pg_statistic into the internal representation for the -column's datatype, and return that value as a Datum. It also needs -a cleaner success/failure return convention --- this business with -"n" return is ridiculously type-specific. Also, the best and easiest -way to find the type to convert to is to look up the column type in -the info for the given relid, not search pg_proc with the staop value. -(I'm not sure that will even work, since there are pg_proc entries -with wildcard argument types.) - -3. The atol() calls currently found in intltsel are a type-specific -cheat on what is conceptually a two-step process: - * Convert the string stored in pg_statistic back to the internal - form for the column data type. - * Generate a numeric representation of the data value that can be - used as an estimate of the range of values in the table. -The second step is trivial for integers, which may obscure the fact -that there are two steps involved, but nonetheless there are. If -you think about applying selectivity logic to strings, say, it -becomes clear that the second step is a necessary component of the -process. Furthermore, the second step must also be applied to the -probe value that's being passed into the selectivity operator. -(The probe value is already in internal form, of course; but it is -not necessarily in a useful numeric form.) - -We can do the first of these steps by applying the appropriate "XXXin" -conversion function for the column data type, as you have done. The -interesting question is how to do the second one. A really clean -solution would require adding a column to pg_type that points to a -function that will do the appropriate conversion. I'd be inclined to -make all of these functions return "double" (float8) and just have one -top-level selectivity routine for all data types that can use -range-based selectivity logic. - -We could probably hack something together that would not use an explicit -conversion function for each data type, but instead would rely on -type-specific assumptions inside the selectivity routines. We'd need many -more selectivity routines though (at least one for each of int, float4, -float8, and text data types) so I'm not sure we'd really save any work -compared to doing it right. - -BTW, now that I look at this issue it's real clear that the selectivity -entries in pg_operator are horribly broken. The intltsel/intgtsel -selectivity routines are currently applied to 32 distinct data types: - -regression=> select distinct typname,oprleft from pg_operator, pg_type -regression-> where pg_type.oid = oprleft -regression-> and oprrest in (103,104); -typname |oprleft ----------+------- -_aclitem | 1034 -abstime | 702 -bool | 16 -box | 603 -bpchar | 1042 -char | 18 -cidr | 650 -circle | 718 -date | 1082 -datetime | 1184 -float4 | 700 -float8 | 701 -inet | 869 -int2 | 21 -int4 | 23 -int8 | 20 -line | 628 -lseg | 601 -macaddr | 829 -money | 790 -name | 19 -numeric | 1700 -oid | 26 -oid8 | 30 -path | 602 -point | 600 -polygon | 604 -text | 25 -time | 1083 -timespan | 1186 -timestamp| 1296 -varchar | 1043 -(32 rows) - -many of which are very obviously not compatible with integer for *any* -purpose. It looks to me like a lot of data types were added to -pg_operator just by copy-and-paste, without paying attention to whether -the selectivity routines were actually correct for the data type. - -As the code stands today, the bogus entries don't matter because -gethilokey always fails, so we always get 1/3 as the selectivity -estimate for any comparison operator (except = and != of course). -I had actually noticed that fact and assumed that it was supposed -to work that way :-(. But, clearly, there is code in here that -is *trying* to be smarter. - -As soon as we fix gethilokey so that it can succeed, we will start -getting essentially-random selectivity estimates for those data types -that aren't actually binary-compatible with integer. That will not do; -we have to do something about the issue. - - regards, tom lane - - -From tgl@sss.pgh.pa.us Tue Mar 23 12:31:02 1999 -Received: from renoir.op.net (root@renoir.op.net [209.152.193.4]) - by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id MAA17484 - for ; Tue, 23 Mar 1999 12:31:01 -0500 (EST) -Received: from sss.sss.pgh.pa.us (sss.pgh.pa.us [206.210.65.6]) by renoir.op.net (o1/$ Revision: 1.18 $) with ESMTP id MAA09042 for ; Tue, 23 Mar 1999 12:10:55 -0500 (EST) -Received: from sss.sss.pgh.pa.us (localhost [127.0.0.1]) - by sss.sss.pgh.pa.us (8.9.1/8.9.1) with ESMTP id MAA24474; - Tue, 23 Mar 1999 12:09:52 -0500 (EST) -To: Bruce Momjian -cc: riedel+@CMU.EDU, pgsql-hackers@postgreSQL.org -Subject: Re: [HACKERS] optimizer and type question -In-reply-to: Your message of Mon, 22 Mar 1999 21:25:45 -0500 (EST) - <199903230225.VAA01641@candle.pha.pa.us> -Date: Tue, 23 Mar 1999 12:09:52 -0500 -Message-ID: <24471.922208992@sss.pgh.pa.us> -From: Tom Lane -Status: RO - -Bruce Momjian writes: -> What we really need is some way to determine how far the requested value -> is from the min/max values. With int, we just do (val-min)/(max-min). -> That works, but how do we do that for types that don't support division. -> Strings come to mind in this case. - -What I'm envisioning is that we still apply the (val-min)/(max-min) -logic, but apply it to numeric values that are produced in a -type-dependent way. - -For ints and floats the conversion is trivial, of course. - -For strings, the first thing that comes to mind is to return 0 for a -null string and the value of the first byte for a non-null string. -This would give you one-part-in-256 selectivity which is plenty good -enough for what the selectivity code needs to do. (Actually, it's -only that good if the strings' first bytes are pretty well spread out. -If you have a table containing English words, for example, you might -only get about one part in 26 this way, since the first bytes will -probably only run from A to Z. Might be better to use the first two -characters of the string to compute the selectivity representation.) - -In general, you can apply this logic as long as you can come up with -some numerical approximation to the data type's sorting order. It -doesn't have to be exact. - - regards, tom lane - -From owner-pgsql-hackers@hub.org Tue Mar 23 12:31:03 1999 -Received: from renoir.op.net (root@renoir.op.net [209.152.193.4]) - by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id MAA17488 - for ; Tue, 23 Mar 1999 12:31:02 -0500 (EST) -Received: from hub.org (majordom@hub.org [209.47.145.100]) by renoir.op.net (o1/$ Revision: 1.18 $) with ESMTP id MAA09987 for ; Tue, 23 Mar 1999 12:21:34 -0500 (EST) -Received: from localhost (majordom@localhost) - by hub.org (8.9.2/8.9.1) with SMTP id MAA95155; - Tue, 23 Mar 1999 12:18:33 -0500 (EST) - (envelope-from owner-pgsql-hackers@hub.org) -Received: by hub.org (TLB v0.10a (1.23 tibbs 1997/01/09 00:29:32)); Tue, 23 Mar 1999 12:17:00 +0000 (EST) -Received: (from majordom@localhost) - by hub.org (8.9.2/8.9.1) id MAA94857 - for pgsql-hackers-outgoing; Tue, 23 Mar 1999 12:16:56 -0500 (EST) - (envelope-from owner-pgsql-hackers@postgreSQL.org) -Received: from sss.sss.pgh.pa.us (sss.pgh.pa.us [206.210.65.6]) - by hub.org (8.9.2/8.9.1) with ESMTP id MAA94469 - for ; Tue, 23 Mar 1999 12:11:33 -0500 (EST) - (envelope-from tgl@sss.pgh.pa.us) -Received: from sss.sss.pgh.pa.us (localhost [127.0.0.1]) - by sss.sss.pgh.pa.us (8.9.1/8.9.1) with ESMTP id MAA24474; - Tue, 23 Mar 1999 12:09:52 -0500 (EST) -To: Bruce Momjian -cc: riedel+@CMU.EDU, pgsql-hackers@postgreSQL.org -Subject: Re: [HACKERS] optimizer and type question -In-reply-to: Your message of Mon, 22 Mar 1999 21:25:45 -0500 (EST) - <199903230225.VAA01641@candle.pha.pa.us> -Date: Tue, 23 Mar 1999 12:09:52 -0500 -Message-ID: <24471.922208992@sss.pgh.pa.us> -From: Tom Lane -Sender: owner-pgsql-hackers@postgreSQL.org -Precedence: bulk -Status: RO - -Bruce Momjian writes: -> What we really need is some way to determine how far the requested value -> is from the min/max values. With int, we just do (val-min)/(max-min). -> That works, but how do we do that for types that don't support division. -> Strings come to mind in this case. - -What I'm envisioning is that we still apply the (val-min)/(max-min) -logic, but apply it to numeric values that are produced in a -type-dependent way. - -For ints and floats the conversion is trivial, of course. - -For strings, the first thing that comes to mind is to return 0 for a -null string and the value of the first byte for a non-null string. -This would give you one-part-in-256 selectivity which is plenty good -enough for what the selectivity code needs to do. (Actually, it's -only that good if the strings' first bytes are pretty well spread out. -If you have a table containing English words, for example, you might -only get about one part in 26 this way, since the first bytes will -probably only run from A to Z. Might be better to use the first two -characters of the string to compute the selectivity representation.) - -In general, you can apply this logic as long as you can come up with -some numerical approximation to the data type's sorting order. It -doesn't have to be exact. - - regards, tom lane - - -From owner-pgsql-hackers@hub.org Thu Jul 1 20:39:19 1999 -Received: from hub.org (hub.org [209.167.229.1]) - by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id UAA15403 - for ; Thu, 1 Jul 1999 20:39:18 -0400 (EDT) -Received: from hub.org (hub.org [209.167.229.1]) - by hub.org (8.9.3/8.9.3) with ESMTP id UAA45018; - Thu, 1 Jul 1999 20:20:27 -0400 (EDT) - (envelope-from owner-pgsql-hackers@hub.org) -Received: by hub.org (TLB v0.10a (1.23 tibbs 1997/01/09 00:29:32)); Thu, 01 Jul 1999 20:15:30 +0000 (EDT) -Received: (from majordom@localhost) - by hub.org (8.9.3/8.9.3) id UAA44474 - for pgsql-hackers-outgoing; Thu, 1 Jul 1999 20:15:28 -0400 (EDT) - (envelope-from owner-pgsql-hackers@postgreSQL.org) -X-Authentication-Warning: hub.org: majordom set sender to owner-pgsql-hackers@postgreSQL.org using -f -Received: from sd.tpf.co.jp (sd.tpf.co.jp [210.161.239.34]) - by hub.org (8.9.3/8.9.3) with ESMTP id UAA44058 - for ; Thu, 1 Jul 1999 20:12:10 -0400 (EDT) - (envelope-from Inoue@tpf.co.jp) -Received: from cadzone ([126.0.1.40] (may be forged)) - by sd.tpf.co.jp (2.5 Build 2640 (Berkeley 8.8.6)/8.8.4) with SMTP - id JAA00279 for ; Fri, 02 Jul 1999 09:11:58 +0900 -From: "Hiroshi Inoue" -To: "pgsql-hackers" -Subject: [HACKERS] Optimization FAQ ? -Date: Fri, 2 Jul 1999 09:14:10 +0900 -Message-ID: <000401bec41f$ce81dcc0$2801007e@cadzone.tpf.co.jp> -MIME-Version: 1.0 -Content-Type: text/plain; - charset="iso-2022-jp" -Content-Transfer-Encoding: 7bit -X-Priority: 3 (Normal) -X-MSMail-Priority: Normal -X-Mailer: Microsoft Outlook 8.5, Build 4.71.2173.0 -X-MimeOLE: Produced By Microsoft MimeOLE V4.72.2106.4 -Importance: Normal -Sender: owner-pgsql-hackers@postgreSQL.org -Precedence: bulk -Status: RO - -Hello all, - -I got the following result. -It's FAQ ? - -drop table int2t; -create table int2t (id int2 primary key); - -explain select * from int2t where id=1; - NOTICE: QUERY PLAN: - - Seq Scan on int2t (cost=43.00 rows=2 width=2) - -explain select * from int2t where id=1::int2; - NOTICE: QUERY PLAN: - - Index Scan using int2t_pkey on int2t (cost=2.05 rows=2 width=2) - -explain select * from int2t where id='1'; - NOTICE: QUERY PLAN: - - Index Scan using int2t_pkey on int2t (cost=2.05 rows=2 width=2) - -Right behavior ? - -Regards. - -Hiroshi Inoue -Inoue@tpf.co.jp - - -From owner-pgsql-hackers@hub.org Thu Jan 20 18:45:32 2000 -Received: from renoir.op.net (root@renoir.op.net [207.29.195.4]) - by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id TAA00672 - for ; Thu, 20 Jan 2000 19:45:30 -0500 (EST) -Received: from hub.org (hub.org [216.126.84.1]) by renoir.op.net (o1/$Revision: 1.20 $) with ESMTP id TAA01989 for ; Thu, 20 Jan 2000 19:39:15 -0500 (EST) -Received: from localhost (majordom@localhost) - by hub.org (8.9.3/8.9.3) with SMTP id TAA00957; - Thu, 20 Jan 2000 19:35:19 -0500 (EST) - (envelope-from owner-pgsql-hackers) -Received: by hub.org (bulk_mailer v1.5); Thu, 20 Jan 2000 19:33:34 -0500 -Received: (from majordom@localhost) - by hub.org (8.9.3/8.9.3) id TAA00581 - for pgsql-hackers-outgoing; Thu, 20 Jan 2000 19:32:37 -0500 (EST) - (envelope-from owner-pgsql-hackers@postgreSQL.org) -Received: from sss2.sss.pgh.pa.us (sss.pgh.pa.us [209.114.166.2]) - by hub.org (8.9.3/8.9.3) with ESMTP id TAA98940 - for ; Thu, 20 Jan 2000 19:31:49 -0500 (EST) - (envelope-from tgl@sss.pgh.pa.us) -Received: from sss2.sss.pgh.pa.us (tgl@localhost [127.0.0.1]) - by sss2.sss.pgh.pa.us (8.9.3/8.9.3) with ESMTP id TAA25390 - for ; Thu, 20 Jan 2000 19:31:32 -0500 (EST) -To: pgsql-hackers@postgreSQL.org -Subject: [HACKERS] Some notes on optimizer cost estimates -Date: Thu, 20 Jan 2000 19:31:32 -0500 -Message-ID: <25387.948414692@sss.pgh.pa.us> -From: Tom Lane -Sender: owner-pgsql-hackers@postgreSQL.org -Status: OR - -I have been spending some time measuring actual runtimes for various -sequential-scan and index-scan query plans, and have learned that the -current Postgres optimizer's cost estimation equations are not very -close to reality at all. - -Presently we estimate the cost of a sequential scan as - - Nblocks + CPU_PAGE_WEIGHT * Ntuples - ---- that is, the unit of cost is the time to read one disk page, -and we have a "fudge factor" that relates CPU time per tuple to -disk time per page. (The default CPU_PAGE_WEIGHT is 0.033, which -is probably too high for modern hardware --- 0.01 seems like it -might be a better default, at least for simple queries.) OK, -it's a simplistic model, but not too unreasonable so far. - -The cost of an index scan is measured in these same terms as - - Nblocks + CPU_PAGE_WEIGHT * Ntuples + - CPU_INDEX_PAGE_WEIGHT * Nindextuples - -Here Ntuples is the number of tuples selected by the index qual -condition (typically, it's less than the total table size used in -sequential-scan estimation). CPU_INDEX_PAGE_WEIGHT essentially -estimates the cost of scanning an index tuple; by default it's 0.017 or -half CPU_PAGE_WEIGHT. Nblocks is estimated as the index size plus an -appropriate fraction of the main table size. - -There are two big problems with this: - -1. Since main-table tuples are visited in index order, we'll be hopping -around from page to page in the table. The current cost estimation -method essentially assumes that the buffer cache plus OS disk cache will -be 100% efficient --- we will never have to read the same page of the -main table twice in a scan, due to having discarded it between -references. This of course is unreasonably optimistic. Worst case -is that we'd fetch a main-table page for each selected tuple, but in -most cases that'd be unreasonably pessimistic. - -2. The cost of a disk page fetch is estimated at 1.0 unit for both -sequential and index scans. In reality, sequential access is *much* -cheaper than the quasi-random accesses performed by an index scan. -This is partly a matter of physical disk seeks, and partly a matter -of benefitting (or not) from any read-ahead logic the OS may employ. - -As best I can measure on my hardware, the cost of a nonsequential -disk read should be estimated at 4 to 5 times the cost of a sequential -one --- I'm getting numbers like 2.2 msec per disk page for sequential -scans, and as much as 11 msec per page for index scans. I don't -know, however, if this ratio is similar enough on other platforms -to be useful for cost estimating. We could make it a parameter like -we do for CPU_PAGE_WEIGHT ... but you know and I know that no one -ever bothers to adjust those numbers in the field ... - -The other effect that needs to be modeled, and currently is not, is the -"hit rate" of buffer cache. Presumably, this is 100% for tables smaller -than the cache and drops off as the table size increases --- but I have -no particular thoughts on the form of the dependency. Does anyone have -ideas here? The problem is complicated by the fact that we don't really -know how big the cache is; we know the number of buffers Postgres has, -but we have no idea how big a disk cache the kernel is keeping. As near -as I can tell, finding a hit in the kernel disk cache is not a lot more -expensive than having the page sitting in Postgres' own buffers --- -certainly it's much much cheaper than a disk read. - -BTW, if you want to do some measurements of your own, try turning on -PGOPTIONS="-d 2 -te". This will dump a lot of interesting numbers -into the postmaster log, if your platform supports getrusage(). - - regards, tom lane - -************ - -From owner-pgsql-hackers@hub.org Thu Jan 20 20:26:33 2000 -Received: from hub.org (hub.org [216.126.84.1]) - by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id VAA06630 - for ; Thu, 20 Jan 2000 21:26:32 -0500 (EST) -Received: from localhost (majordom@localhost) - by hub.org (8.9.3/8.9.3) with SMTP id VAA35022; - Thu, 20 Jan 2000 21:22:08 -0500 (EST) - (envelope-from owner-pgsql-hackers) -Received: by hub.org (bulk_mailer v1.5); Thu, 20 Jan 2000 21:20:35 -0500 -Received: (from majordom@localhost) - by hub.org (8.9.3/8.9.3) id VAA34569 - for pgsql-hackers-outgoing; Thu, 20 Jan 2000 21:19:38 -0500 (EST) - (envelope-from owner-pgsql-hackers@postgreSQL.org) -Received: from hercules.cs.ucsb.edu (hercules.cs.ucsb.edu [128.111.41.30]) - by hub.org (8.9.3/8.9.3) with ESMTP id VAA34534 - for ; Thu, 20 Jan 2000 21:19:26 -0500 (EST) - (envelope-from xun@cs.ucsb.edu) -Received: from xp10-06.dialup.commserv.ucsb.edu (root@xp10-06.dialup.commserv.ucsb.edu [128.111.253.249]) - by hercules.cs.ucsb.edu (8.8.6/8.8.6) with ESMTP id SAA04655 - for ; Thu, 20 Jan 2000 18:19:22 -0800 (PST) -Received: from xp10-06.dialup.commserv.ucsb.edu (xun@localhost) - by xp10-06.dialup.commserv.ucsb.edu (8.9.3/8.9.3) with ESMTP id SAA22377 - for ; Thu, 20 Jan 2000 18:19:40 -0800 -Message-Id: <200001210219.SAA22377@xp10-06.dialup.commserv.ucsb.edu> -To: pgsql-hackers@postgreSQL.org -Reply-to: xun@cs.ucsb.edu -Subject: Re. [HACKERS] Some notes on optimizer cost estimates -Date: Thu, 20 Jan 2000 18:19:40 -0800 -From: Xun Cheng -Sender: owner-pgsql-hackers@postgreSQL.org -Status: OR - -I'm very glad you bring up this cost estimate issue. -Recent work in database research have argued a more -detailed disk access cost model should be used for -large queries especially joins. -Traditional cost estimate only considers the number of -disk pages accessed. However a more detailed model -would consider three parameters: avg. seek, avg. latency -and avg. page transfer. For old disk, typical values are -SEEK=9.5 milliseconds, LATENCY=8.3 ms, TRANSFER=2.6ms. -A sequential continuous reading of a table (assuming -1000 continuous pages) would cost -(SEEK+LATENCY+1000*TRANFER=2617.8ms); while quasi-randomly -reading 200 times with 2 continuous pages/time would -cost (SEEK+200*LATENCY+400*TRANSFER=2700ms). -Someone from IBM lab re-studied the traditional -ad hoc join algorithms (nested, sort-merge, hash) using the detailed cost model -and found some interesting results. - ->I have been spending some time measuring actual runtimes for various ->sequential-scan and index-scan query plans, and have learned that the ->current Postgres optimizer's cost estimation equations are not very ->close to reality at all. - -One interesting question I'd like to ask is if this non-closeness -really affects the optimal choice of postgresql's query optimizer. -And to what degree the effects might be? My point is that -if the optimizer estimated the cost for sequential-scan is 10 and -the cost for index-scan is 20 while the actual costs are 10 vs. 40, -it should be ok because the optimizer would still choose sequential-scan -as it should. - ->1. Since main-table tuples are visited in index order, we'll be hopping ->around from page to page in the table. - -I'm not sure about the implementation in postgresql. One thing you might -be able to do is to first collect all must-read page addresses from -the index scan and then order them before the actual ordered page fetching. -It would at least avoid the same page being read twice (not entirely -true depending on the context (like in join) and algo.) - ->The current cost estimation ->method essentially assumes that the buffer cache plus OS disk cache will ->be 100% efficient --- we will never have to read the same page of the ->main table twice in a scan, due to having discarded it between ->references. This of course is unreasonably optimistic. Worst case ->is that we'd fetch a main-table page for each selected tuple, but in ->most cases that'd be unreasonably pessimistic. - -This is actually the motivation that I asked before if postgresql -has a raw disk facility. That way we have much control on this cache -issue. Of course only if we can provide some algo. better than OS -cache algo. (depending on the context, like large joins), a raw disk -facility will be worthwhile (besides the recoverability). - -Actually I have another question for you guys which is somehow related -to this cost estimation issue. You know the difference between OLTP -and OLAP. My question is how you target postgresql on both kinds -of applications or just OLTP. From what I know OLTP and OLAP would -have a big difference in query characteristics and thus -optimization difference. If postgresql is only targeted on -OLTP, the above cost estimation issue might not be that -important. However for OLAP, large tables and large queries are -common and optimization would be difficult. - -xun - - -************ - -From owner-pgsql-hackers@hub.org Thu Jan 20 20:41:44 2000 -Received: from hub.org (hub.org [216.126.84.1]) - by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id VAA07020 - for ; Thu, 20 Jan 2000 21:41:43 -0500 (EST) -Received: from localhost (majordom@localhost) - by hub.org (8.9.3/8.9.3) with SMTP id VAA40222; - Thu, 20 Jan 2000 21:34:08 -0500 (EST) - (envelope-from owner-pgsql-hackers) -Received: by hub.org (bulk_mailer v1.5); Thu, 20 Jan 2000 21:32:35 -0500 -Received: (from majordom@localhost) - by hub.org (8.9.3/8.9.3) id VAA38388 - for pgsql-hackers-outgoing; Thu, 20 Jan 2000 21:31:38 -0500 (EST) - (envelope-from owner-pgsql-hackers@postgreSQL.org) -Received: from sss2.sss.pgh.pa.us (sss.pgh.pa.us [209.114.166.2]) - by hub.org (8.9.3/8.9.3) with ESMTP id VAA37422 - for ; Thu, 20 Jan 2000 21:31:02 -0500 (EST) - (envelope-from tgl@sss.pgh.pa.us) -Received: from sss2.sss.pgh.pa.us (tgl@localhost [127.0.0.1]) - by sss2.sss.pgh.pa.us (8.9.3/8.9.3) with ESMTP id VAA26761; - Thu, 20 Jan 2000 21:30:41 -0500 (EST) -To: "Hiroshi Inoue" -cc: pgsql-hackers@postgreSQL.org -Subject: Re: [HACKERS] Some notes on optimizer cost estimates -In-reply-to: <000b01bf63b1$093cbd40$2801007e@tpf.co.jp> -References: <000b01bf63b1$093cbd40$2801007e@tpf.co.jp> -Comments: In-reply-to "Hiroshi Inoue" - message dated "Fri, 21 Jan 2000 10:44:20 +0900" -Date: Thu, 20 Jan 2000 21:30:41 -0500 -Message-ID: <26758.948421841@sss.pgh.pa.us> -From: Tom Lane -Sender: owner-pgsql-hackers@postgreSQL.org -Status: ORr - -"Hiroshi Inoue" writes: -> I've wondered why we cound't analyze database without vacuum. -> We couldn't run vacuum light-heartedly because it acquires an -> exclusive lock for the target table. - -There is probably no real good reason, except backwards compatibility, -why the ANALYZE function (obtaining pg_statistic data) is part of -VACUUM at all --- it could just as easily be a separate command that -would only use read access on the database. Bruce is thinking about -restructuring VACUUM, so maybe now is a good time to think about -splitting out the ANALYZE code too. - -> In addition,vacuum error occurs with analyze option in most -> cases AFAIK. - -Still, with current sources? What's the error message? I fixed -a problem with pg_statistic tuples getting too big... - - regards, tom lane - -************ - -From tgl@sss.pgh.pa.us Thu Jan 20 21:10:28 2000 -Received: from sss2.sss.pgh.pa.us (sss.pgh.pa.us [209.114.166.2]) - by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id WAA08412 - for ; Thu, 20 Jan 2000 22:10:26 -0500 (EST) -Received: from sss2.sss.pgh.pa.us (tgl@localhost [127.0.0.1]) - by sss2.sss.pgh.pa.us (8.9.3/8.9.3) with ESMTP id WAA27080; - Thu, 20 Jan 2000 22:10:28 -0500 (EST) -To: Bruce Momjian -cc: Hiroshi Inoue , pgsql-hackers@postgresql.org -Subject: Re: [HACKERS] Some notes on optimizer cost estimates -In-reply-to: <200001210248.VAA07186@candle.pha.pa.us> -References: <200001210248.VAA07186@candle.pha.pa.us> -Comments: In-reply-to Bruce Momjian - message dated "Thu, 20 Jan 2000 21:48:57 -0500" -Date: Thu, 20 Jan 2000 22:10:28 -0500 -Message-ID: <27077.948424228@sss.pgh.pa.us> -From: Tom Lane -Status: OR - -Bruce Momjian writes: -> It is nice that ANALYZE is done during vacuum. I can't imagine why you -> would want to do an analyze without adding a vacuum to it. I guess -> that's why I made them the same command. - -Well, the main bad thing about ANALYZE being part of VACUUM is that -it adds to the length of time that VACUUM is holding an exclusive -lock on the table. I think it'd make more sense for it to be a -separate command. - -I have also been thinking about how to make ANALYZE produce a more -reliable estimate of the most common value. The three-element list -that it keeps now is a good low-cost hack, but it really doesn't -produce a trustworthy answer unless the MCV is pretty darn C (since -it will never pick up on the MCV at all until there are at least -two occurrences in three adjacent tuples). The only idea I've come -up with is to use a larger list, which would be slower and take -more memory. I think that'd be OK in a separate command, but I -hesitate to do it inside VACUUM --- VACUUM has its own considerable -memory requirements, and there's still the issue of not holding down -an exclusive lock longer than you have to. - - regards, tom lane - -From Inoue@tpf.co.jp Thu Jan 20 21:08:32 2000 -Received: from sd.tpf.co.jp (sd.tpf.co.jp [210.161.239.34]) - by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id WAA08225 - for ; Thu, 20 Jan 2000 22:08:29 -0500 (EST) -Received: from cadzone ([126.0.1.40] (may be forged)) - by sd.tpf.co.jp (2.5 Build 2640 (Berkeley 8.8.6)/8.8.4) with SMTP - id MAA04148; Fri, 21 Jan 2000 12:08:30 +0900 -From: "Hiroshi Inoue" -To: "Bruce Momjian" , "Tom Lane" -Cc: -Subject: RE: [HACKERS] Some notes on optimizer cost estimates -Date: Fri, 21 Jan 2000 12:14:10 +0900 -Message-ID: <001301bf63bd$95cbe680$2801007e@tpf.co.jp> -MIME-Version: 1.0 -Content-Type: text/plain; - charset="iso-8859-1" -Content-Transfer-Encoding: 7bit -X-Priority: 3 (Normal) -X-MSMail-Priority: Normal -X-Mailer: Microsoft Outlook 8.5, Build 4.71.2173.0 -X-MimeOLE: Produced By Microsoft MimeOLE V5.00.2314.1300 -In-Reply-To: <200001210248.VAA07186@candle.pha.pa.us> -Importance: Normal -Status: OR - -> -----Original Message----- -> From: Bruce Momjian [mailto:pgman@candle.pha.pa.us] -> -> > "Hiroshi Inoue" writes: -> > > I've wondered why we cound't analyze database without vacuum. -> > > We couldn't run vacuum light-heartedly because it acquires an -> > > exclusive lock for the target table. -> > -> > There is probably no real good reason, except backwards compatibility, -> > why the ANALYZE function (obtaining pg_statistic data) is part of -> > VACUUM at all --- it could just as easily be a separate command that -> > would only use read access on the database. Bruce is thinking about -> > restructuring VACUUM, so maybe now is a good time to think about -> > splitting out the ANALYZE code too. -> -> I put it in vacuum because at the time I didn't know how to do such -> things and vacuum already scanned the table. I just linked on the the -> scan. Seemed like a good idea at the time. -> -> It is nice that ANALYZE is done during vacuum. I can't imagine why you -> would want to do an analyze without adding a vacuum to it. I guess -> that's why I made them the same command. -> -> If I made them separate commands, both would have to scan the table, -> though the analyze could do it without the exclusive lock, which would -> be good. -> - -The functionality of VACUUM and ANALYZE is quite different. -I don't prefer to charge VACUUM more than now about analyzing -database. Probably looong lock,more aborts .... -Various kind of analysis would be possible by splitting out ANALYZE. - -Regards. - -Hiroshi Inoue -Inoue@tpf.co.jp - -From owner-pgsql-hackers@hub.org Fri Jan 21 11:01:59 2000 -Received: from hub.org (hub.org [216.126.84.1]) - by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id MAA07821 - for ; Fri, 21 Jan 2000 12:01:57 -0500 (EST) -Received: from localhost (majordom@localhost) - by hub.org (8.9.3/8.9.3) with SMTP id LAA77357; - Fri, 21 Jan 2000 11:52:25 -0500 (EST) - (envelope-from owner-pgsql-hackers) -Received: by hub.org (bulk_mailer v1.5); Fri, 21 Jan 2000 11:50:46 -0500 -Received: (from majordom@localhost) - by hub.org (8.9.3/8.9.3) id LAA76756 - for pgsql-hackers-outgoing; Fri, 21 Jan 2000 11:49:50 -0500 (EST) - (envelope-from owner-pgsql-hackers@postgreSQL.org) -Received: from eclipse.pacifier.com (eclipse.pacifier.com [199.2.117.78]) - by hub.org (8.9.3/8.9.3) with ESMTP id LAA76594 - for ; Fri, 21 Jan 2000 11:49:01 -0500 (EST) - (envelope-from dhogaza@pacifier.com) -Received: from desktop (dsl-dhogaza.pacifier.net [216.65.147.68]) - by eclipse.pacifier.com (8.9.3/8.9.3pop) with SMTP id IAA00225; - Fri, 21 Jan 2000 08:47:26 -0800 (PST) -Message-Id: <3.0.1.32.20000121081044.01036290@mail.pacifier.com> -X-Sender: dhogaza@mail.pacifier.com -X-Mailer: Windows Eudora Pro Version 3.0.1 (32) -Date: Fri, 21 Jan 2000 08:10:44 -0800 -To: xun@cs.ucsb.edu, pgsql-hackers@postgreSQL.org -From: Don Baccus -Subject: Re: Re. [HACKERS] Some notes on optimizer cost estimates -In-Reply-To: <200001210219.SAA22377@xp10-06.dialup.commserv.ucsb.edu> -Mime-Version: 1.0 -Content-Type: text/plain; charset="us-ascii" -Sender: owner-pgsql-hackers@postgreSQL.org -Status: OR - -At 06:19 PM 1/20/00 -0800, Xun Cheng wrote: ->I'm very glad you bring up this cost estimate issue. ->Recent work in database research have argued a more ->detailed disk access cost model should be used for ->large queries especially joins. ->Traditional cost estimate only considers the number of ->disk pages accessed. However a more detailed model ->would consider three parameters: avg. seek, avg. latency ->and avg. page transfer. For old disk, typical values are ->SEEK=9.5 milliseconds, LATENCY=8.3 ms, TRANSFER=2.6ms. ->A sequential continuous reading of a table (assuming ->1000 continuous pages) would cost ->(SEEK+LATENCY+1000*TRANFER=2617.8ms); while quasi-randomly ->reading 200 times with 2 continuous pages/time would ->cost (SEEK+200*LATENCY+400*TRANSFER=2700ms). ->Someone from IBM lab re-studied the traditional ->ad hoc join algorithms (nested, sort-merge, hash) using the detailed cost -model ->and found some interesting results. - -One complication when doing an index scan is that you are -accessing two separate files (table and index), which can frequently -be expected to cause an considerable increase in average seek time. - -Oracle and other commercial databases recommend spreading indices and -tables over several spindles if at all possible in order to minimize -this effect. - -I suspect it also helps their optimizer make decisions that are -more consistently good for customers with the largest and most -complex databases and queries, by making cost estimates more predictably -reasonable. - -Still...this doesn't help with the question about the effect of the -filesystem system cache. I wandered around the web for a little bit -last night, and found one summary of a paper by Osterhout on the -effect of the Solaris cache on a fileserver serving diskless workstations. -There was reference to the hierarchy involved (i.e. the local workstation -cache is faster than the fileserver's cache which has to be read via -the network which in turn is faster than reading from the fileserver's -disk). It appears the rule-of-thumb for the cache-hit ratio on reads, -presumably based on measuring some internal Sun systems, used in their -calculations was 80%. - -Just a datapoint to think about. - -There's also considerable operating system theory on paging systems -that might be useful for thinking about trying to estimate the -Postgres cache/hit ratio. Then again, maybe Postgres could just -keep count of how many pages of a given table are in the cache at -any given time? Or simply keep track of the current ratio of hits -and misses? - ->>I have been spending some time measuring actual runtimes for various ->>sequential-scan and index-scan query plans, and have learned that the ->>current Postgres optimizer's cost estimation equations are not very ->>close to reality at all. - ->One interesting question I'd like to ask is if this non-closeness ->really affects the optimal choice of postgresql's query optimizer. ->And to what degree the effects might be? My point is that ->if the optimizer estimated the cost for sequential-scan is 10 and ->the cost for index-scan is 20 while the actual costs are 10 vs. 40, ->it should be ok because the optimizer would still choose sequential-scan ->as it should. - -This is crucial, of course - if there are only two types of scans -available, what ever heuristic is used only has to be accurate enough -to pick the right one. Once the choice is made, it doesn't really -matter (from the optimizer's POV) just how long it will actually take, -the time will be spent and presumably it will be shorter than the -alternative. - -How frequently will the optimizer choose wrongly if: - -1. All of the tables and indices were in PG buffer cache or filesystem - cache? (i.e. fixed access times for both types of scans) - -or - -2. The table's so big that only a small fraction can reside in RAM - during the scan and join, which means that the non-sequential - disk access pattern of the indexed scan is much more expensive. - -Also, if you pick sequential scans more frequently based on a presumption -that index scans are expensive due to increased average seek time, how -often will this penalize the heavy-duty user that invests in extra -drives and lots of RAM? - -... - ->>The current cost estimation ->>method essentially assumes that the buffer cache plus OS disk cache will ->>be 100% efficient --- we will never have to read the same page of the ->>main table twice in a scan, due to having discarded it between ->>references. This of course is unreasonably optimistic. Worst case ->>is that we'd fetch a main-table page for each selected tuple, but in ->>most cases that'd be unreasonably pessimistic. -> ->This is actually the motivation that I asked before if postgresql ->has a raw disk facility. That way we have much control on this cache ->issue. Of course only if we can provide some algo. better than OS ->cache algo. (depending on the context, like large joins), a raw disk ->facility will be worthwhile (besides the recoverability). - -Postgres does have control over its buffer cache. The one thing that -raw disk I/O would give you is control over where blocks are placed, -meaning you could more accurately model the cost of retrieving them. -So presumably the cache could be tuned to the allocation algorithm -used to place various structures on the disk. - -I still wonder just how much gain you get by this approach. Compared, -to, say simply spending $2,000 on a gigabyte of RAM. Heck, PCs even -support a couple gigs of RAM now. - ->Actually I have another question for you guys which is somehow related ->to this cost estimation issue. You know the difference between OLTP ->and OLAP. My question is how you target postgresql on both kinds ->of applications or just OLTP. From what I know OLTP and OLAP would ->have a big difference in query characteristics and thus ->optimization difference. If postgresql is only targeted on ->OLTP, the above cost estimation issue might not be that ->important. However for OLAP, large tables and large queries are ->common and optimization would be difficult. - - - -- Don Baccus, Portland OR - Nature photos, on-line guides, Pacific Northwest - Rare Bird Alert Service and other goodies at - http://donb.photo.net. - -************ - -From pgsql-hackers-owner+M6019@hub.org Mon Aug 21 11:47:56 2000 -Received: from hub.org (root@hub.org [216.126.84.1]) - by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id LAA07289 - for ; Mon, 21 Aug 2000 11:47:55 -0400 (EDT) -Received: from hub.org (majordom@localhost [127.0.0.1]) - by hub.org (8.10.1/8.10.1) with SMTP id e7LFlpT03383; - Mon, 21 Aug 2000 11:47:51 -0400 (EDT) -Received: from mail.fct.unl.pt (fct1.si.fct.unl.pt [193.136.120.1]) - by hub.org (8.10.1/8.10.1) with SMTP id e7LFlaT03243 - for ; Mon, 21 Aug 2000 11:47:37 -0400 (EDT) -Received: (qmail 7416 invoked by alias); 21 Aug 2000 15:54:33 -0000 -Received: (qmail 7410 invoked from network); 21 Aug 2000 15:54:32 -0000 -Received: from eros.si.fct.unl.pt (193.136.120.112) - by fct1.si.fct.unl.pt with SMTP; 21 Aug 2000 15:54:32 -0000 -Date: Mon, 21 Aug 2000 16:48:08 +0100 (WEST) -From: =?iso-8859-1?Q?Tiago_Ant=E3o?= -X-Sender: tiago@eros.si.fct.unl.pt -To: Tom Lane -cc: pgsql-hackers@postgresql.org -Subject: Re: [HACKERS] Optimisation deficiency: currval('seq')-->seq scan, - constant-->index scan -In-Reply-To: <1731.966868649@sss.pgh.pa.us> -Message-ID: -MIME-Version: 1.0 -Content-Type: TEXT/PLAIN; charset=US-ASCII -X-Mailing-List: pgsql-hackers@postgresql.org -Precedence: bulk -Sender: pgsql-hackers-owner@hub.org -Status: ORr - -On Mon, 21 Aug 2000, Tom Lane wrote: - -> > One thing it might be interesting (please tell me if you think -> > otherwise) would be to improve pg with better statistical information, by -> > using, for example, histograms. -> -> Yes, that's been on the todo list for a while. - - If it's ok and nobody is working on that, I'll look on that subject. - I'll start by looking at the analize portion of vacuum. I'm thinking in -using arrays for the histogram (I've never used the array data type of -postgres). - Should I use 7.0.2 or the cvs version? - - -> Interesting article. We do most of what she talks about, but we don't -> have anything like the ClusterRatio statistic. We need it --- that was -> just being discussed a few days ago in another thread. Do you have any -> reference on exactly how DB2 defines that stat? - - - I don't remember seeing that information spefically. From what I've -read I can speculate: - - 1. They have clusterratios for both indexes and the relation itself. - 2. They might use an index even if there is no "order by" if the table -has a low clusterratio: just to get the RIDs, then sort the RIDs and -fetch. - 3. One possible way to calculate this ratio: - a) for tables - SeqScan - if tuple points to a next tuple on the same page then its -"good" - ratio = # good tuples / # all tuples - b) for indexes (high speculation ratio here) - foreach pointed RID in index - if RID is in same page of next RID in index than mark as -"good" - - I suspect that if a tuple size is big (relative to page size) than the -cluster ratio is always low. - - A tuple might also be "good" if it pointed to the next page. - -Tiago - - -From pgsql-hackers-owner+M6152@hub.org Wed Aug 23 13:00:33 2000 -Received: from hub.org (root@hub.org [216.126.84.1]) - by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id NAA10259 - for ; Wed, 23 Aug 2000 13:00:33 -0400 (EDT) -Received: from hub.org (majordom@localhost [127.0.0.1]) - by hub.org (8.10.1/8.10.1) with SMTP id e7NGsPN83008; - Wed, 23 Aug 2000 12:54:25 -0400 (EDT) -Received: from mail.fct.unl.pt (fct1.si.fct.unl.pt [193.136.120.1]) - by hub.org (8.10.1/8.10.1) with SMTP id e7NGniN81749 - for ; Wed, 23 Aug 2000 12:49:44 -0400 (EDT) -Received: (qmail 9869 invoked by alias); 23 Aug 2000 15:10:04 -0000 -Received: (qmail 9860 invoked from network); 23 Aug 2000 15:10:04 -0000 -Received: from eros.si.fct.unl.pt (193.136.120.112) - by fct1.si.fct.unl.pt with SMTP; 23 Aug 2000 15:10:04 -0000 -Date: Wed, 23 Aug 2000 16:03:42 +0100 (WEST) -From: =?iso-8859-1?Q?Tiago_Ant=E3o?= -X-Sender: tiago@eros.si.fct.unl.pt -To: Tom Lane -cc: Jules Bean , pgsql-hackers@postgresql.org -Subject: Re: [HACKERS] Optimisation deficiency: currval('seq')-->seq scan, - constant-->index scan -In-Reply-To: <27971.967041030@sss.pgh.pa.us> -Message-ID: -MIME-Version: 1.0 -Content-Type: TEXT/PLAIN; charset=US-ASCII -X-Mailing-List: pgsql-hackers@postgresql.org -Precedence: bulk -Sender: pgsql-hackers-owner@hub.org -Status: ORr - -Hi! - -On Wed, 23 Aug 2000, Tom Lane wrote: - -> Yes, we know about that one. We have stats about the most common value -> in a column, but no information about how the less-common values are -> distributed. We definitely need stats about several top values not just -> one, because this phenomenon of a badly skewed distribution is pretty -> common. - - - An end-biased histogram has stats on top values and also on the least -frequent values. So if a there is a selection on a value that is well -bellow average, the selectivity estimation will be more acurate. On some -research papers I've read, it's refered that this is a better approach -than equi-width histograms (which are said to be the "industry" standard). - - I not sure whether to use a table or a array attribute on pg_stat for -the histogram, the problem is what could be expected from the size of the -attribute (being a text). I'm very affraid of the cost of going through -several tuples on a table (pg_histogram?) during the optimization phase. - - One other idea would be to only have better statistics for special -attributes requested by the user... something like "analyze special -table(column)". - -Best Regards, -Tiago - - - -From pgsql-hackers-owner+M6160@hub.org Thu Aug 24 00:21:39 2000 -Received: from hub.org (root@hub.org [216.126.84.1]) - by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id AAA27662 - for ; Thu, 24 Aug 2000 00:21:38 -0400 (EDT) -Received: from hub.org (majordom@localhost [127.0.0.1]) - by hub.org (8.10.1/8.10.1) with SMTP id e7O46w585951; - Thu, 24 Aug 2000 00:06:58 -0400 (EDT) -Received: from sss2.sss.pgh.pa.us (sss.pgh.pa.us [209.114.166.2]) - by hub.org (8.10.1/8.10.1) with ESMTP id e7O3uv583775 - for ; Wed, 23 Aug 2000 23:56:57 -0400 (EDT) -Received: from sss2.sss.pgh.pa.us (tgl@localhost [127.0.0.1]) - by sss2.sss.pgh.pa.us (8.9.3/8.9.3) with ESMTP id XAA20973; - Wed, 23 Aug 2000 23:56:35 -0400 (EDT) -To: =?iso-8859-1?Q?Tiago_Ant=E3o?= -cc: Jules Bean , pgsql-hackers@postgresql.org -Subject: Re: [HACKERS] Optimisation deficiency: currval('seq')-->seq scan, constant-->index scan -In-reply-to: -References: -Comments: In-reply-to =?iso-8859-1?Q?Tiago_Ant=E3o?= - message dated "Wed, 23 Aug 2000 16:03:42 +0100" -Date: Wed, 23 Aug 2000 23:56:35 -0400 -Message-ID: <20970.967089395@sss.pgh.pa.us> -From: Tom Lane -X-Mailing-List: pgsql-hackers@postgresql.org -Precedence: bulk -Sender: pgsql-hackers-owner@hub.org -Status: OR - -=?iso-8859-1?Q?Tiago_Ant=E3o?= writes: -> One other idea would be to only have better statistics for special -> attributes requested by the user... something like "analyze special -> table(column)". - -This might actually fall out "for free" from the cheapest way of -implementing the stats. We've talked before about scanning btree -indexes directly to obtain data values in sorted order, which makes -it very easy to find the most common values. If you do that, you -get good stats for exactly those columns that the user has created -indexes on. A tad indirect but I bet it'd be effective... - - regards, tom lane - -From pgsql-hackers-owner+M6165@hub.org Thu Aug 24 05:33:02 2000 -Received: from hub.org (root@hub.org [216.126.84.1]) - by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id FAA14309 - for ; Thu, 24 Aug 2000 05:33:01 -0400 (EDT) -Received: from hub.org (majordom@localhost [127.0.0.1]) - by hub.org (8.10.1/8.10.1) with SMTP id e7O9X0584670; - Thu, 24 Aug 2000 05:33:00 -0400 (EDT) -Received: from athena.office.vi.net (office-gwb.fulham.vi.net [194.88.77.158]) - by hub.org (8.10.1/8.10.1) with ESMTP id e7O9Ix581216 - for ; Thu, 24 Aug 2000 05:19:03 -0400 (EDT) -Received: from grommit.office.vi.net [192.168.1.200] (mail) - by athena.office.vi.net with esmtp (Exim 3.12 #1 (Debian)) - id 13Rt2Y-00073I-00; Thu, 24 Aug 2000 10:11:14 +0100 -Received: from jules by grommit.office.vi.net with local (Exim 3.12 #1 (Debian)) - id 13Rt2Y-0005GV-00; Thu, 24 Aug 2000 10:11:14 +0100 -Date: Thu, 24 Aug 2000 10:11:14 +0100 -From: Jules Bean -To: Tom Lane -Cc: Tiago Ant?o , pgsql-hackers@postgresql.org -Subject: Re: [HACKERS] Optimisation deficiency: currval('seq')-->seq scan, constant-->index scan -Message-ID: <20000824101113.N17510@grommit.office.vi.net> -References: <1731.966868649@sss.pgh.pa.us> <20000823133418.F17510@grommit.office.vi.net> <27971.967041030@sss.pgh.pa.us> -Mime-Version: 1.0 -Content-Type: text/plain; charset=us-ascii -Content-Disposition: inline -User-Agent: Mutt/1.2i -In-Reply-To: <27971.967041030@sss.pgh.pa.us>; from tgl@sss.pgh.pa.us on Wed, Aug 23, 2000 at 10:30:30AM -0400 -X-Mailing-List: pgsql-hackers@postgresql.org -Precedence: bulk -Sender: pgsql-hackers-owner@hub.org -Status: OR - -On Wed, Aug 23, 2000 at 10:30:30AM -0400, Tom Lane wrote: -> Jules Bean writes: -> > I have in a table a 'category' column which takes a small number of -> > (basically fixed) values. Here by 'small', I mean ~1000, while the -> > table itself has ~10 000 000 rows. Some categories have many, many -> > more rows than others. In particular, there's one category which hits -> > over half the rows. Because of this (AIUI) postgresql assumes -> > that the query -> > select ... from thistable where category='something' -> > is best served by a seqscan, even though there is an index on -> > category. -> -> Yes, we know about that one. We have stats about the most common value -> in a column, but no information about how the less-common values are -> distributed. We definitely need stats about several top values not just -> one, because this phenomenon of a badly skewed distribution is pretty -> common. - -ISTM that that might be enough, in fact. - -If you have stats telling you that the most popular value is 'xyz', -and that it constitutes 50% of the rows (i.e. 5 000 000) then you can -conclude that, on average, other entries constitute a mere 5 000 -000/999 ~~ 5000 entries, and it would be definitely be enough. -(That's assuming you store the number of distinct values somewhere). - - -> BTW, if your highly-popular value is actually a dummy value ('UNKNOWN' -> or something like that), a fairly effective workaround is to replace the -> dummy entries with NULL. The system does account for NULLs separately -> from real values, so you'd then get stats based on the most common -> non-dummy value. - -I can't really do that. Even if I could, the distribution is very -skewed -- so the next most common makes up a very high proportion of -what's left. I forget the figures exactly. - -Jules - -From pgsql-hackers-owner+M6154@hub.org Wed Aug 23 14:36:41 2000 -Received: from hub.org (root@hub.org [216.126.84.1]) - by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id NAA11076 - for ; Wed, 23 Aug 2000 13:36:41 -0400 (EDT) -Received: from hub.org (majordom@localhost [127.0.0.1]) - by hub.org (8.10.1/8.10.1) with SMTP id e7NHTqN92431; - Wed, 23 Aug 2000 13:29:52 -0400 (EDT) -Received: from mail.fct.unl.pt (fct1.si.fct.unl.pt [193.136.120.1]) - by hub.org (8.10.1/8.10.1) with SMTP id e7NHM1N90883 - for ; Wed, 23 Aug 2000 13:22:01 -0400 (EDT) -Received: (qmail 13816 invoked by alias); 23 Aug 2000 17:29:02 -0000 -Received: (qmail 13807 invoked from network); 23 Aug 2000 17:29:02 -0000 -Received: from eros.si.fct.unl.pt (193.136.120.112) - by fct1.si.fct.unl.pt with SMTP; 23 Aug 2000 17:29:02 -0000 -Date: Wed, 23 Aug 2000 18:22:40 +0100 (WEST) -From: =?iso-8859-1?Q?Tiago_Ant=E3o?= -X-Sender: tiago@eros.si.fct.unl.pt -To: Tom Lane -cc: =?iso-8859-1?Q?Tiago_Ant=E3o?= , - PostgreSQL Hackers list -Subject: Re: [HACKERS] analyze.c -In-Reply-To: <28154.967041988@sss.pgh.pa.us> -Message-ID: -MIME-Version: 1.0 -Content-Type: TEXT/PLAIN; charset=US-ASCII -X-Mailing-List: pgsql-hackers@postgresql.org -Precedence: bulk -Sender: pgsql-hackers-owner@hub.org -Status: ORr - - - -On Wed, 23 Aug 2000, Tom Lane wrote: - -> > What's the big reason not to do that? I know that -> > there is some code in analyze.c (like comparing) that uses other parts of -> > pg, but that seems to be easily fixed. -> -> Are you proposing not to do any comparisons? It will be interesting to -> see how you can compute a histogram without any idea of equality or -> ordering. But if you want that, then you still need the function-call -> manager as well as the type-specific comparison routines for every -> datatype that you might be asked to operate on (don't forget -> user-defined types here). - - I forgot user defined data types :-(, but regarding histograms I think -the code can be made external (at least for testing purposes): - 1. I was not suggesting not to do any comparisons, but I think the only -comparison I need is equality, I don't need order as I don't need to -calculate mins or maxs (I just need mins and maxes on frequencies, NOT on -dat itself) to make a histogram. - 2. The mapping to text guarantees that I have (PQgetvalue returns -always char* and pg_statistics keeps a "text" anyway) a way of knowing -about equality regardless of type. - - But at least anything relating to order has to be in. - -> > I'm leaning toward the implementation of end-biased histograms. There is -> > an introductory reference in the IEEE Data Engineering Bulletin, september -> > 1995 (available on microsoft research site). -> -> Sounds interesting. Can you give us an exact URL? - -http://www.research.microsoft.com/research/db/debull/default.htm - -BTW, you can get access to SIGMOD CDs with lots of goodies for a very low -price (at least in 1999 it was a bargain), check out ACM membership for -sigmod. - -I've been reading something about implementation of histograms, and, -AFAIK, in practice histograms is just a cool name for no more than: - 1. top ten with frequency for each - 2. the same for top ten worse - 3. average for the rest - -I'm writing code get this info (outside pg for now - for testing -purposes). - -Best Regards, -Tiago -PS - again: I'm starting, so, some of my comments can be completly dumb. - -From pgsql-hackers-owner+M7514@hub.org Sun Oct 15 20:38:12 2000 -Received: from hub.org (hub.org [216.126.84.1]) - by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id TAA18459 - for ; Sun, 15 Oct 2000 19:38:12 -0400 (EDT) -Received: from hub.org.org (localhost [127.0.0.1]) - by hub.org (8.10.1/8.10.1) with SMTP id e9FNaUR59496; - Sun, 15 Oct 2000 19:36:30 -0400 (EDT) -Received: from sss.pgh.pa.us (sss.pgh.pa.us [209.114.132.154]) - by hub.org (8.10.1/8.10.1) with ESMTP id e9FNYuR58276 - for ; Sun, 15 Oct 2000 19:34:56 -0400 (EDT) -Received: from sss2.sss.pgh.pa.us (tgl@localhost [127.0.0.1]) - by sss.pgh.pa.us (8.11.0/8.11.0) with ESMTP id e9FNXaB06046; - Sun, 15 Oct 2000 19:33:36 -0400 (EDT) -To: Bruce Momjian -cc: Jules Bean , - Alfred Perlstein , pgsql-hackers@postgresql.org -Subject: Re: [HACKERS] Performance on inserts -In-reply-to: <200010152320.TAA17944@candle.pha.pa.us> -References: <200010152320.TAA17944@candle.pha.pa.us> -Comments: In-reply-to Bruce Momjian - message dated "Sun, 15 Oct 2000 19:20:35 -0400" -Date: Sun, 15 Oct 2000 19:33:36 -0400 -Message-ID: <6043.971652816@sss.pgh.pa.us> -From: Tom Lane -X-Mailing-List: pgsql-hackers@postgresql.org -Precedence: bulk -Sender: pgsql-hackers-owner@hub.org -Status: ORr - -Bruce Momjian writes: -> However, assume tab2.col2 equals 3. I assume this would cause an index -> scan because the executor doesn't know about the most common value, -> right? Is it worth trying to improve that? - -Oh, I see: you are assuming that a nestloop join is being done, and -wondering if it's worthwhile to switch dynamically between seqscan -and indexscan for each scan of the inner relation, depending on exactly -what value is being supplied from the outer relation for that scan. -Hmm. - -Not sure if it's worth the trouble or not. Nestloop is usually a -last-resort join strategy anyway, and is unlikely to be picked when the -tables are large enough to make performance be a big issue. - - regards, tom lane - -From tgl@sss.pgh.pa.us Mon Oct 16 01:48:27 2000 -Received: from sss.pgh.pa.us (sss.pgh.pa.us [209.114.132.154]) - by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id AAA01602 - for ; Mon, 16 Oct 2000 00:48:26 -0400 (EDT) -Received: from sss2.sss.pgh.pa.us (tgl@localhost [127.0.0.1]) - by sss.pgh.pa.us (8.11.1/8.11.1) with ESMTP id e9G4mu521809; - Mon, 16 Oct 2000 00:48:56 -0400 (EDT) -To: Bruce Momjian -cc: Jules Bean , - Alfred Perlstein , pgsql-hackers@postgresql.org -Subject: Re: [HACKERS] Performance on inserts -In-reply-to: <200010160441.AAA01374@candle.pha.pa.us> -References: <200010160441.AAA01374@candle.pha.pa.us> -Comments: In-reply-to Bruce Momjian - message dated "Mon, 16 Oct 2000 00:41:49 -0400" -Date: Mon, 16 Oct 2000 00:48:56 -0400 -Message-ID: <21806.971671736@sss.pgh.pa.us> -From: Tom Lane -Status: OR - -Bruce Momjian writes: ->> So an inner indexscan for tab1 is definitely a possible plan. - -> Yes, that was my point, that a nested loop could easily be involved if -> the joined table has a restriction. Is there a TODO item here? - -More like a "to investigate" --- I'm not sold on the idea that a -dynamic switch in plan types would be a win. Maybe it would be, -but... - -One thing to think about is that it'd be critically dependent on having -accurate statistics. Currently, the planner only places bets on the -average behavior over a whole join. If you make a separate bet on each -scan, then you open up the risk of betting wrong every time, should -your stats be out-of-date or otherwise misleading. - - regards, tom lane - -From pgsql-hackers-owner+M29943@postgresql.org Thu Oct 3 18:18:27 2002 -Return-path: -Received: from postgresql.org (postgresql.org [64.49.215.8]) - by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g93MIOU23771 - for ; Thu, 3 Oct 2002 18:18:25 -0400 (EDT) -Received: from localhost (postgresql.org [64.49.215.8]) - by postgresql.org (Postfix) with ESMTP - id B9F51476570; Thu, 3 Oct 2002 18:18:21 -0400 (EDT) -Received: from postgresql.org (postgresql.org [64.49.215.8]) - by postgresql.org (Postfix) with SMTP - id E083B4761B0; Thu, 3 Oct 2002 18:18:19 -0400 (EDT) -Received: from localhost (postgresql.org [64.49.215.8]) - by postgresql.org (Postfix) with ESMTP id 13ADC476063 - for ; Thu, 3 Oct 2002 18:18:17 -0400 (EDT) -Received: from acorn.he.net (acorn.he.net [64.71.137.130]) - by postgresql.org (Postfix) with ESMTP id 3AEC8475FFF - for ; Thu, 3 Oct 2002 18:18:16 -0400 (EDT) -Received: from CurtisVaio ([63.164.0.47] (may be forged)) by acorn.he.net (8.8.6/8.8.2) with SMTP id PAA19215; Thu, 3 Oct 2002 15:18:14 -0700 -From: "Curtis Faith" -To: "Tom Lane" -cc: "Pgsql-Hackers" -Subject: Re: [HACKERS] Advice: Where could I be of help? -Date: Thu, 3 Oct 2002 18:17:55 -0400 -Message-ID: -MIME-Version: 1.0 -Content-Type: text/plain; - charset="iso-8859-1" -Content-Transfer-Encoding: 7bit -X-Priority: 3 (Normal) -X-MSMail-Priority: Normal -X-Mailer: Microsoft Outlook IMO, Build 9.0.2416 (9.0.2911.0) -In-Reply-To: <13379.1033675158@sss.pgh.pa.us> -X-MimeOLE: Produced By Microsoft MimeOLE V5.00.2919.6700 -Importance: Normal -X-Virus-Scanned: by AMaViS new-20020517 -Precedence: bulk -Sender: pgsql-hackers-owner@postgresql.org -X-Virus-Scanned: by AMaViS new-20020517 -Status: OR - -tom lane wrote: -> But more globally, I think that our worst problems these days have to do -> with planner misestimations leading to bad plans. The planner is -> usually *capable* of generating a good plan, but all too often it picks -> the wrong one. We need work on improving the cost modeling equations -> to be closer to reality. If that's at all close to your sphere of -> interest then I think it should be #1 priority --- it's both localized, -> which I think is important for a first project, and potentially a -> considerable win. - -This seems like a very interesting problem. One of the ways that I thought -would be interesting and would solve the problem of trying to figure out the -right numbers is to have certain guesses for the actual values based on -statistics gathered during vacuum and general running and then have the -planner run the "best" plan. - -Then during execution if the planner turned out to be VERY wrong about -certain assumptions the execution system could update the stats that led to -those wrong assumptions. That way the system would seek the correct values -automatically. We could also gather the stats that the system produces for -certain actual databases and then use those to make smarter initial guesses. - -I've found that I can never predict costs. I always end up testing -empirically and find myself surprised at the results. - -We should be able to make the executor smart enough to keep count of actual -costs (or a statistical approximation) without introducing any significant -overhead. - -tom lane also wrote: -> There is no "cache flushing". We have a shared buffer cache management -> algorithm that's straight LRU across all buffers. There's been some -> interest in smarter cache-replacement code; I believe Neil Conway is -> messing around with an LRU-2 implementation right now. If you've got -> better ideas we're all ears. - -Hmmm, this is the area that I think could lead to huge performance gains. - -Consider a simple system with a table tbl_master that gets read by each -process many times but with very infrequent inserts and that contains about -3,000 rows. The single but heavily used index for this table is contained in -a btree with a depth of three with 20 - 8K pages in the first two levels of -the btree. - -Another table tbl_detail with 10 indices that gets very frequent inserts. -There are over 300,000 rows. Some queries result in index scans over the -approximatley 5,000 8K pages in the index. - -There is a 40M shared cache for this system. - -Everytime a query which requires the index scan runs it will blow out the -entire cache since the scan will load more blocks than the cache holds. Only -blocks that are accessed while the scan is going will survive. LRU is bad, -bad, bad! - -LRU-2 might be better but it seems like it still won't give enough priority -to the most frequently used blocks. I don't see how it would do better for -the above case. - -I once implemented a modified cache algorithm that was based on the clock -algorithm for VM page caches. VM paging is similar to databases in that -there is definite locality of reference and certain pages are MUCH more -likely to be requested. - -The basic idea was to have a flag in each block that represented the access -time in clock intervals. Imagine a clock hand sweeping across a clock, every -access is like a tiny movement in the clock hand. Blocks that are not -accessed during a sweep are candidates for removal. - -My modification was to use access counts to increase the durability of the -more accessed blocks. Each time a block is accessed it's flag is shifted -left (up to a maximum number of shifts - ShiftN ) and 1 is added to it. -Every so many cache accesses (and synchronously when the cache is full) a -pass is made over each block, right shifting the flags (a clock sweep). This -can also be done one block at a time each access so the clock is directly -linked to the cache access rate. Any blocks with 0 are placed into a doubly -linked list of candidates for removal. New cache blocks are allocated from -the list of candidates. Accesses of blocks in the candidate list just -removes them from the list. - -An index root node page would likely be accessed frequently enough so that -all it's bits would be set so it would take ShiftN clock sweeps. - -This algorithm increased the cache hit ratio from 40% to about 90% for the -cases I tested when compared to a simple LRU mechanism. The paging ratio is -greatly dependent on the ratio of the actual database size to the cache -size. - -The bottom line that it is very important to keep blocks that are frequently -accessed in the cache. The top levels of large btrees are accessed many -hundreds (actually a power of the number of keys in each page) of times more -frequently than the leaf pages. LRU can be the worst possible algorithm for -something like an index or table scan of large tables since it flushes a -large number of potentially frequently accessed blocks in favor of ones that -are very unlikely to be retrieved again. - -tom lane also wrote: -> This is an interesting area. Keep in mind though that Postgres is a -> portable DB that tries to be agnostic about what kernel and filesystem -> it's sitting on top of --- and in any case it does not run as root, so -> has very limited ability to affect what the kernel/filesystem do. -> I'm not sure how much can be done without losing those portability -> advantages. - -The kinds of things I was thinking about should be very portable. I found -that simply writing the cache in order of the file system offset results in -very greatly improved performance since it lets the head seek in smaller -increments and much more smoothly, especially with modern disks. Most of the -time the file system will create files are large sequential bytes on the -physical disks in order. It might be in a few chunks but those chunks will -be sequential and fairly large. - -tom lane also wrote: -> Well, not really all that isolated. The bottom-level index code doesn't -> know whether you're doing INSERT or UPDATE, and would have no easy -> access to the original tuple if it did know. The original theory about -> this was that the planner could detect the situation where the index(es) -> don't overlap the set of columns being changed by the UPDATE, which -> would be nice since there'd be zero runtime overhead. Unfortunately -> that breaks down if any BEFORE UPDATE triggers are fired that modify the -> tuple being stored. So all in all it turns out to be a tad messy to fit -> this in :-(. I am unconvinced that the impact would be huge anyway, -> especially as of 7.3 which has a shortcut path for dead index entries. - -Well, this probably is not the right place to start then. - -- Curtis - - ----------------------------(end of broadcast)--------------------------- -TIP 4: Don't 'kill -9' the postmaster - -From pgsql-hackers-owner+M29945@postgresql.org Thu Oct 3 18:47:34 2002 -Return-path: -Received: from postgresql.org (postgresql.org [64.49.215.8]) - by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g93MlWU26068 - for ; Thu, 3 Oct 2002 18:47:32 -0400 (EDT) -Received: from localhost (postgresql.org [64.49.215.8]) - by postgresql.org (Postfix) with ESMTP - id F2AAE476306; Thu, 3 Oct 2002 18:47:27 -0400 (EDT) -Received: from postgresql.org (postgresql.org [64.49.215.8]) - by postgresql.org (Postfix) with SMTP - id E7B5247604F; Thu, 3 Oct 2002 18:47:24 -0400 (EDT) -Received: from localhost (postgresql.org [64.49.215.8]) - by postgresql.org (Postfix) with ESMTP id 9ADCC4761A1 - for ; Thu, 3 Oct 2002 18:47:18 -0400 (EDT) -Received: from sss.pgh.pa.us (unknown [192.204.191.242]) - by postgresql.org (Postfix) with ESMTP id DDB0B476187 - for ; Thu, 3 Oct 2002 18:47:17 -0400 (EDT) -Received: from sss2.sss.pgh.pa.us (tgl@localhost [127.0.0.1]) - by sss.pgh.pa.us (8.12.5/8.12.5) with ESMTP id g93MlIhR015091; - Thu, 3 Oct 2002 18:47:18 -0400 (EDT) -To: "Curtis Faith" -cc: "Pgsql-Hackers" -Subject: Re: [HACKERS] Advice: Where could I be of help? -In-Reply-To: -References: -Comments: In-reply-to "Curtis Faith" - message dated "Thu, 03 Oct 2002 18:17:55 -0400" -Date: Thu, 03 Oct 2002 18:47:17 -0400 -Message-ID: <15090.1033685237@sss.pgh.pa.us> -From: Tom Lane -X-Virus-Scanned: by AMaViS new-20020517 -Precedence: bulk -Sender: pgsql-hackers-owner@postgresql.org -X-Virus-Scanned: by AMaViS new-20020517 -Status: OR - -"Curtis Faith" writes: -> Then during execution if the planner turned out to be VERY wrong about -> certain assumptions the execution system could update the stats that led to -> those wrong assumptions. That way the system would seek the correct values -> automatically. - -That has been suggested before, but I'm unsure how to make it work. -There are a lot of parameters involved in any planning decision and it's -not obvious which ones to tweak, or in which direction, if the plan -turns out to be bad. But if you can come up with some ideas, go to -it! - -> Everytime a query which requires the index scan runs it will blow out the -> entire cache since the scan will load more blocks than the cache -> holds. - -Right, that's the scenario that kills simple LRU ... - -> LRU-2 might be better but it seems like it still won't give enough priority -> to the most frequently used blocks. - -Blocks touched more than once per query (like the upper-level index -blocks) will survive under LRU-2. Blocks touched once per query won't. -Seems to me that it should be a win. - -> My modification was to use access counts to increase the durability of the -> more accessed blocks. - -You could do it that way too, but I'm unsure whether the extra -complexity will buy anything. Ultimately, I think an LRU-anything -algorithm is equivalent to a clock sweep for those pages that only get -touched once per some-long-interval: the single-touch guys get recycled -in order of last use, which seems just like a clock sweep around the -cache. The guys with some amount of preference get excluded from the -once-around sweep. To determine whether LRU-2 is better or worse than -some other preference algorithm requires a finer grain of analysis than -this. I'm not a fan of "more complex must be better", so I'd want to see -why it's better before buying into it ... - -> The kinds of things I was thinking about should be very portable. I found -> that simply writing the cache in order of the file system offset results in -> very greatly improved performance since it lets the head seek in smaller -> increments and much more smoothly, especially with modern disks. - -Shouldn't the OS be responsible for scheduling those writes -appropriately? Ye good olde elevator algorithm ought to handle this; -and it's at least one layer closer to the actual disk layout than we -are, thus more likely to issue the writes in a good order. It's worth -experimenting with, perhaps, but I'm pretty dubious about it. - -BTW, one other thing that Vadim kept saying we should do is alter the -cache management strategy to retain dirty blocks in memory (ie, give -some amount of preference to as-yet-unwritten dirty pages compared to -clean pages). There is no reliability cost here since the WAL will let -us reconstruct any dirty pages if we crash before they get written; and -the periodic checkpoints will ensure that we eventually write a dirty -block and thus it will become available for recycling. This seems like -a promising line of thought that's orthogonal to the basic -LRU-vs-whatever issue. Nobody's got round to looking at it yet though. -I've got no idea how much preference should be given to a dirty block ---- not infinite, probably, but some. - - regards, tom lane - ----------------------------(end of broadcast)--------------------------- -TIP 5: Have you checked our extensive FAQ? - -http://www.postgresql.org/users-lounge/docs/faq.html - -From pgsql-hackers-owner+M29974@postgresql.org Fri Oct 4 01:28:54 2002 -Return-path: -Received: from postgresql.org (postgresql.org [64.49.215.8]) - by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g945SpU13476 - for ; Fri, 4 Oct 2002 01:28:52 -0400 (EDT) -Received: from localhost (postgresql.org [64.49.215.8]) - by postgresql.org (Postfix) with ESMTP - id 63999476BB2; Fri, 4 Oct 2002 01:26:56 -0400 (EDT) -Received: from postgresql.org (postgresql.org [64.49.215.8]) - by postgresql.org (Postfix) with SMTP - id BB7CA476B85; Fri, 4 Oct 2002 01:26:54 -0400 (EDT) -Received: from localhost (postgresql.org [64.49.215.8]) - by postgresql.org (Postfix) with ESMTP id 5FD7E476759 - for ; Fri, 4 Oct 2002 01:26:52 -0400 (EDT) -Received: from mclean.mail.mindspring.net (mclean.mail.mindspring.net [207.69.200.57]) - by postgresql.org (Postfix) with ESMTP id 1F4A14766D8 - for ; Fri, 4 Oct 2002 01:26:51 -0400 (EDT) -Received: from 1cust163.tnt1.st-thomas.vi.da.uu.net ([200.58.4.163] helo=CurtisVaio) - by mclean.mail.mindspring.net with smtp (Exim 3.33 #1) - id 17xKzB-0000yK-00; Fri, 04 Oct 2002 01:26:49 -0400 -From: "Curtis Faith" -To: "Tom Lane" -cc: "Pgsql-Hackers" -Subject: Re: [HACKERS] Advice: Where could I be of help? -Date: Fri, 4 Oct 2002 01:26:36 -0400 -Message-ID: -MIME-Version: 1.0 -Content-Type: text/plain; - charset="iso-8859-1" -Content-Transfer-Encoding: 7bit -X-Priority: 3 (Normal) -X-MSMail-Priority: Normal -X-Mailer: Microsoft Outlook IMO, Build 9.0.2416 (9.0.2911.0) -In-Reply-To: <15090.1033685237@sss.pgh.pa.us> -X-MimeOLE: Produced By Microsoft MimeOLE V5.00.2919.6700 -Importance: Normal -X-Virus-Scanned: by AMaViS new-20020517 -Precedence: bulk -Sender: pgsql-hackers-owner@postgresql.org -X-Virus-Scanned: by AMaViS new-20020517 -Status: OR - -I wrote: - -> > My modification was to use access counts to increase the -> durability of the -> > more accessed blocks. -> - -tom lane replies: -> You could do it that way too, but I'm unsure whether the extra -> complexity will buy anything. Ultimately, I think an LRU-anything -> algorithm is equivalent to a clock sweep for those pages that only get -> touched once per some-long-interval: the single-touch guys get recycled -> in order of last use, which seems just like a clock sweep around the -> cache. The guys with some amount of preference get excluded from the -> once-around sweep. To determine whether LRU-2 is better or worse than -> some other preference algorithm requires a finer grain of analysis than -> this. I'm not a fan of "more complex must be better", so I'd want to see -> why it's better before buying into it ... - -I'm definitely not a fan of "more complex must be better either". In fact, -its surprising how often the real performance problems are easy to fix -and simple while many person years are spent solving the issue everyone -"knows" must be causing the performance problems only to find little gain. - -The key here is empirical testing. If the cache hit ratio for LRU-2 is -much better then there may be no need here. OTOH, it took less than -less than 30 lines or so of code to do what I described, so I don't consider -it too, too "more complex" :=} We should run a test which includes -running indexes (or is indices the PostgreSQL convention?) that are three -or more times the size of the cache to see how well LRU-2 works. Is there -any cache performance reporting built into pgsql? - -tom lane wrote: -> Shouldn't the OS be responsible for scheduling those writes -> appropriately? Ye good olde elevator algorithm ought to handle this; -> and it's at least one layer closer to the actual disk layout than we -> are, thus more likely to issue the writes in a good order. It's worth -> experimenting with, perhaps, but I'm pretty dubious about it. - -I wasn't proposing anything other than changing the order of the writes, -not actually ensuring that they get written that way at the level you -describe above. This will help a lot on brain-dead file systems that -can't do this ordering and probably also in cases where the number -of blocks in the cache is very large. - -On a related note, while looking at the code, it seems to me that we -are writing out the buffer cache synchronously, so there won't be -any possibility of the file system reordering anyway. This appears to be -a huge performance problem. I've read claims in the archives that -that the buffers are written asynchronously but my read of the -code says otherwise. Can someone point out my error? - -I only see calls that ultimately call FileWrite or write(2) which will -block without a O_NOBLOCK open. I thought one of the main reasons -for having a WAL is so that you can write out the buffer's asynchronously. - -What am I missing? - -I wrote: -> > Then during execution if the planner turned out to be VERY wrong about -> > certain assumptions the execution system could update the stats -> that led to -> > those wrong assumptions. That way the system would seek the -> correct values -> > automatically. - -tom lane replied: -> That has been suggested before, but I'm unsure how to make it work. -> There are a lot of parameters involved in any planning decision and it's -> not obvious which ones to tweak, or in which direction, if the plan -> turns out to be bad. But if you can come up with some ideas, go to -> it! - -I'll have to look at the current planner before I can suggest -anything concrete. - -- Curtis - - ----------------------------(end of broadcast)--------------------------- -TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org - diff --git a/doc/TODO.detail/persistent b/doc/TODO.detail/persistent deleted file mode 100644 index f4edaad5ac..0000000000 --- a/doc/TODO.detail/persistent +++ /dev/null @@ -1,102 +0,0 @@ -From owner-pgsql-hackers@hub.org Mon May 11 11:31:09 1998 -Received: from renoir.op.net (root@renoir.op.net [209.152.193.4]) - by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id LAA03006 - for ; Mon, 11 May 1998 11:31:07 -0400 (EDT) -Received: from hub.org (hub.org [209.47.148.200]) by renoir.op.net (o1/$ Revision: 1.17 $) with ESMTP id LAA01663 for ; Mon, 11 May 1998 11:24:42 -0400 (EDT) -Received: from localhost (majordom@localhost) by hub.org (8.8.8/8.7.5) with SMTP id LAA21841; Mon, 11 May 1998 11:15:25 -0400 (EDT) -Received: by hub.org (TLB v0.10a (1.23 tibbs 1997/01/09 00:29:32)); Mon, 11 May 1998 11:15:12 +0000 (EDT) -Received: (from majordom@localhost) by hub.org (8.8.8/8.7.5) id LAA21683 for pgsql-hackers-outgoing; Mon, 11 May 1998 11:15:09 -0400 (EDT) -Received: from sss.sss.pgh.pa.us (sss.pgh.pa.us [206.210.65.6]) by hub.org (8.8.8/8.7.5) with ESMTP id LAA21451 for ; Mon, 11 May 1998 11:15:03 -0400 (EDT) -Received: from sss.sss.pgh.pa.us (localhost [127.0.0.1]) - by sss.sss.pgh.pa.us (8.8.5/8.8.5) with ESMTP id LAA24915; - Mon, 11 May 1998 11:14:43 -0400 (EDT) -To: Brett McCormick -cc: hackers@postgreSQL.org -Subject: Re: [HACKERS] Re: [PATCHES] Try again: S_LOCK reduced contentionh] -In-reply-to: Your message of Mon, 11 May 1998 07:57:23 -0700 (PDT) - <13655.4384.345723.466046@abraxas.scene.com> -Date: Mon, 11 May 1998 11:14:43 -0400 -Message-ID: <24913.894899683@sss.pgh.pa.us> -From: Tom Lane -Sender: owner-pgsql-hackers@hub.org -Precedence: bulk -Status: RO - -Brett McCormick writes: -> same way that the current network socket is passed -- through an execv -> argument. hopefully, however, the non-execv()ing fork will be in 6.4. - -Um, you missed the point, Brett. David was hoping to transfer a client -connection from the postmaster to an *already existing* backend process. -Fork, with or without exec, solves the problem for a backend that's -started after the postmaster has accepted the client socket. - -This does lead to a different line of thought, however. Pre-started -backends would have access to the "master" connection socket on which -the postmaster listens for client connections, right? Suppose that we -fire the postmaster as postmaster, and demote it to being simply a -manufacturer of new backend processes as old ones get used up. Have -one of the idle backend processes be the one doing the accept() on the -master socket. Once it has a client connection, it performs the -authentication handshake and then starts serving the client (or just -quits if authentication fails). Meanwhile the next idle backend process -has executed accept() on the master socket and is waiting for the next -client; and shortly the postmaster/factory/whateverwecallitnow notices -that it needs to start another backend to add to the idle-backend pool. - -This'd probably need some interlocking among the backends. I have no -idea whether it'd be safe to have all the idle backends trying to -do accept() on the master socket simultaneously, but it sounds risky. -Better to use a mutex so that only one gets to do it while the others -sleep. - - regards, tom lane - - -From owner-pgsql-hackers@hub.org Mon May 11 11:35:55 1998 -Received: from hub.org (hub.org [209.47.148.200]) - by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id LAA03043 - for ; Mon, 11 May 1998 11:35:53 -0400 (EDT) -Received: from localhost (majordom@localhost) by hub.org (8.8.8/8.7.5) with SMTP id LAA23494; Mon, 11 May 1998 11:27:10 -0400 (EDT) -Received: by hub.org (TLB v0.10a (1.23 tibbs 1997/01/09 00:29:32)); Mon, 11 May 1998 11:27:02 +0000 (EDT) -Received: (from majordom@localhost) by hub.org (8.8.8/8.7.5) id LAA23473 for pgsql-hackers-outgoing; Mon, 11 May 1998 11:27:01 -0400 (EDT) -Received: from sss.sss.pgh.pa.us (sss.pgh.pa.us [206.210.65.6]) by hub.org (8.8.8/8.7.5) with ESMTP id LAA23462 for ; Mon, 11 May 1998 11:26:56 -0400 (EDT) -Received: from sss.sss.pgh.pa.us (localhost [127.0.0.1]) - by sss.sss.pgh.pa.us (8.8.5/8.8.5) with ESMTP id LAA25006; - Mon, 11 May 1998 11:26:44 -0400 (EDT) -To: Brett McCormick -cc: hackers@postgreSQL.org -Subject: Re: [HACKERS] Re: [PATCHES] Try again: S_LOCK reduced contentionh] -In-reply-to: Your message of Mon, 11 May 1998 07:57:23 -0700 (PDT) - <13655.4384.345723.466046@abraxas.scene.com> -Date: Mon, 11 May 1998 11:26:44 -0400 -Message-ID: <25004.894900404@sss.pgh.pa.us> -From: Tom Lane -Sender: owner-pgsql-hackers@hub.org -Precedence: bulk -Status: RO - -Meanwhile, *I* missed the point about Brett's second comment :-( - -Brett McCormick writes: -> There will have to be some sort of arg parsing in any case, -> considering that you can pass configurable arguments to the backend.. - -If we do the sort of change David and I were just discussing, then the -pre-spawned backend would become responsible for parsing and dealing -with the PGOPTIONS portion of the client's connection request message. -That's just part of shifting the authentication handshake code from -postmaster to backend, so it shouldn't be too hard. - -BUT: the whole point is to be able to initialize the backend before it -is connected to a client. How much of the expensive backend startup -work depends on having the client connection options available? -Any work that needs to know the options will have to wait until after -the client connects. If that means most of the startup work can't -happen in advance anyway, then we're out of luck; a pre-started backend -won't save enough time to be worth the effort. (Unless we are willing -to eliminate or redefine the troublesome options...) - - regards, tom lane - - diff --git a/doc/TODO.detail/pool b/doc/TODO.detail/pool index 0a809e148f..e154f5638a 100644 --- a/doc/TODO.detail/pool +++ b/doc/TODO.detail/pool @@ -1319,3 +1319,105 @@ DDI: +64(4)916-7201 MOB: +64(21)635-694 OFFICE: +64(4)499-2267 ---------------------------(end of broadcast)--------------------------- TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org +From owner-pgsql-hackers@hub.org Mon May 11 11:31:09 1998 +Received: from renoir.op.net (root@renoir.op.net [209.152.193.4]) + by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id LAA03006 + for ; Mon, 11 May 1998 11:31:07 -0400 (EDT) +Received: from hub.org (hub.org [209.47.148.200]) by renoir.op.net (o1/$ Revision: 1.17 $) with ESMTP id LAA01663 for ; Mon, 11 May 1998 11:24:42 -0400 (EDT) +Received: from localhost (majordom@localhost) by hub.org (8.8.8/8.7.5) with SMTP id LAA21841; Mon, 11 May 1998 11:15:25 -0400 (EDT) +Received: by hub.org (TLB v0.10a (1.23 tibbs 1997/01/09 00:29:32)); Mon, 11 May 1998 11:15:12 +0000 (EDT) +Received: (from majordom@localhost) by hub.org (8.8.8/8.7.5) id LAA21683 for pgsql-hackers-outgoing; Mon, 11 May 1998 11:15:09 -0400 (EDT) +Received: from sss.sss.pgh.pa.us (sss.pgh.pa.us [206.210.65.6]) by hub.org (8.8.8/8.7.5) with ESMTP id LAA21451 for ; Mon, 11 May 1998 11:15:03 -0400 (EDT) +Received: from sss.sss.pgh.pa.us (localhost [127.0.0.1]) + by sss.sss.pgh.pa.us (8.8.5/8.8.5) with ESMTP id LAA24915; + Mon, 11 May 1998 11:14:43 -0400 (EDT) +To: Brett McCormick +cc: hackers@postgreSQL.org +Subject: Re: [HACKERS] Re: [PATCHES] Try again: S_LOCK reduced contentionh] +In-reply-to: Your message of Mon, 11 May 1998 07:57:23 -0700 (PDT) + <13655.4384.345723.466046@abraxas.scene.com> +Date: Mon, 11 May 1998 11:14:43 -0400 +Message-ID: <24913.894899683@sss.pgh.pa.us> +From: Tom Lane +Sender: owner-pgsql-hackers@hub.org +Precedence: bulk +Status: RO + +Brett McCormick writes: +> same way that the current network socket is passed -- through an execv +> argument. hopefully, however, the non-execv()ing fork will be in 6.4. + +Um, you missed the point, Brett. David was hoping to transfer a client +connection from the postmaster to an *already existing* backend process. +Fork, with or without exec, solves the problem for a backend that's +started after the postmaster has accepted the client socket. + +This does lead to a different line of thought, however. Pre-started +backends would have access to the "master" connection socket on which +the postmaster listens for client connections, right? Suppose that we +fire the postmaster as postmaster, and demote it to being simply a +manufacturer of new backend processes as old ones get used up. Have +one of the idle backend processes be the one doing the accept() on the +master socket. Once it has a client connection, it performs the +authentication handshake and then starts serving the client (or just +quits if authentication fails). Meanwhile the next idle backend process +has executed accept() on the master socket and is waiting for the next +client; and shortly the postmaster/factory/whateverwecallitnow notices +that it needs to start another backend to add to the idle-backend pool. + +This'd probably need some interlocking among the backends. I have no +idea whether it'd be safe to have all the idle backends trying to +do accept() on the master socket simultaneously, but it sounds risky. +Better to use a mutex so that only one gets to do it while the others +sleep. + + regards, tom lane + + +From owner-pgsql-hackers@hub.org Mon May 11 11:35:55 1998 +Received: from hub.org (hub.org [209.47.148.200]) + by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id LAA03043 + for ; Mon, 11 May 1998 11:35:53 -0400 (EDT) +Received: from localhost (majordom@localhost) by hub.org (8.8.8/8.7.5) with SMTP id LAA23494; Mon, 11 May 1998 11:27:10 -0400 (EDT) +Received: by hub.org (TLB v0.10a (1.23 tibbs 1997/01/09 00:29:32)); Mon, 11 May 1998 11:27:02 +0000 (EDT) +Received: (from majordom@localhost) by hub.org (8.8.8/8.7.5) id LAA23473 for pgsql-hackers-outgoing; Mon, 11 May 1998 11:27:01 -0400 (EDT) +Received: from sss.sss.pgh.pa.us (sss.pgh.pa.us [206.210.65.6]) by hub.org (8.8.8/8.7.5) with ESMTP id LAA23462 for ; Mon, 11 May 1998 11:26:56 -0400 (EDT) +Received: from sss.sss.pgh.pa.us (localhost [127.0.0.1]) + by sss.sss.pgh.pa.us (8.8.5/8.8.5) with ESMTP id LAA25006; + Mon, 11 May 1998 11:26:44 -0400 (EDT) +To: Brett McCormick +cc: hackers@postgreSQL.org +Subject: Re: [HACKERS] Re: [PATCHES] Try again: S_LOCK reduced contentionh] +In-reply-to: Your message of Mon, 11 May 1998 07:57:23 -0700 (PDT) + <13655.4384.345723.466046@abraxas.scene.com> +Date: Mon, 11 May 1998 11:26:44 -0400 +Message-ID: <25004.894900404@sss.pgh.pa.us> +From: Tom Lane +Sender: owner-pgsql-hackers@hub.org +Precedence: bulk +Status: RO + +Meanwhile, *I* missed the point about Brett's second comment :-( + +Brett McCormick writes: +> There will have to be some sort of arg parsing in any case, +> considering that you can pass configurable arguments to the backend.. + +If we do the sort of change David and I were just discussing, then the +pre-spawned backend would become responsible for parsing and dealing +with the PGOPTIONS portion of the client's connection request message. +That's just part of shifting the authentication handshake code from +postmaster to backend, so it shouldn't be too hard. + +BUT: the whole point is to be able to initialize the backend before it +is connected to a client. How much of the expensive backend startup +work depends on having the client connection options available? +Any work that needs to know the options will have to wait until after +the client connects. If that means most of the startup work can't +happen in advance anyway, then we're out of luck; a pre-started backend +won't save enough time to be worth the effort. (Unless we are willing +to eliminate or redefine the troublesome options...) + + regards, tom lane + + diff --git a/doc/TODO.detail/prepare b/doc/TODO.detail/prepare deleted file mode 100644 index 5d66512690..0000000000 --- a/doc/TODO.detail/prepare +++ /dev/null @@ -1,1697 +0,0 @@ -From owner-pgsql-hackers@hub.org Wed Nov 18 14:40:49 1998 -Received: from hub.org (majordom@hub.org [209.47.148.200]) - by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id OAA29743 - for ; Wed, 18 Nov 1998 14:40:36 -0500 (EST) -Received: from localhost (majordom@localhost) - by hub.org (8.9.1/8.9.1) with SMTP id OAA03716; - Wed, 18 Nov 1998 14:37:04 -0500 (EST) - (envelope-from owner-pgsql-hackers@hub.org) -Received: by hub.org (TLB v0.10a (1.23 tibbs 1997/01/09 00:29:32)); Wed, 18 Nov 1998 14:34:39 +0000 (EST) -Received: (from majordom@localhost) - by hub.org (8.9.1/8.9.1) id OAA03395 - for pgsql-hackers-outgoing; Wed, 18 Nov 1998 14:34:37 -0500 (EST) - (envelope-from owner-pgsql-hackers@postgreSQL.org) -Received: from orion.SAPserv.Hamburg.dsh.de (Tpolaris2.sapham.debis.de [53.2.131.8]) - by hub.org (8.9.1/8.9.1) with SMTP id OAA03381 - for ; Wed, 18 Nov 1998 14:34:31 -0500 (EST) - (envelope-from wieck@sapserv.debis.de) -Received: by orion.SAPserv.Hamburg.dsh.de - for pgsql-hackers@hub.org - id m0zgDnj-000EBTC; Wed, 18 Nov 98 21:02 MET -Message-Id: -From: jwieck@debis.com (Jan Wieck) -Subject: Re: [HACKERS] PREPARE -To: meskes@usa.net (Michael Meskes) -Date: Wed, 18 Nov 1998 21:02:06 +0100 (MET) -Cc: pgsql-hackers@hub.org -Reply-To: jwieck@debis.com (Jan Wieck) -In-Reply-To: <19981118084843.B869@usa.net> from "Michael Meskes" at Nov 18, 98 08:48:43 am -X-Mailer: ELM [version 2.4 PL25] -Content-Type: text -Sender: owner-pgsql-hackers@postgreSQL.org -Precedence: bulk -Status: RO - -Michael Meskes wrote: - -> -> On Wed, Nov 18, 1998 at 03:23:30AM +0000, Thomas G. Lockhart wrote: -> > > I didn't get this one completly. What input do you mean? -> > -> > Just the original string/query to be prepared... -> -> I see. But wouldn't it be more useful to preprocess the query and store the -> resulting nodes instead? We don't want to parse the statement everytime a -> variable binding comes in. - - Right. A real improvement would only be to have the prepared - execution plan in the backend and just giving the parameter - values. - - I can think of the following construct: - - PREPARE optimizable-statement; - - That one will run parser/rewrite/planner, create a new memory - context with a unique identifier and saves the querytree's - and plan's in it. Parameter values are identified by the - usual $n notation. The command returns the identifier. - - EXECUTE QUERY identifier [value [, ...]]; - - then get's back the prepared plan and querytree by the id, - creates an executor context with the given values in the - parameter array and calls ExecutorRun() for them. - - The PREPARE needs to analyze the resulting parsetrees to get - the datatypes (and maybe atttypmod's) of the parameters, so - EXECUTE QUERY can convert the values into Datum's using the - types input functions. And the EXECUTE has to be handled - special in tcop (it's something between a regular query and - an utility statement). But it's not too hard to implement. - - Finally a - - FORGET QUERY identifier; - - (don't remember how the others named it) will remove the - prepared plan etc. simply by destroying the memory context - and dropping the identifier from the id->mcontext+prepareinfo - mapping. - - This all restricts the usage of PREPARE to optimizable - statements. Is it required to be able to prepare utility - statements (like CREATE TABLE or so) too? - - -Jan - --- - -#======================================================================# -# It's easier to get forgiveness for being wrong than for being right. # -# Let's break this rule - forgive me. # -#======================================== jwieck@debis.com (Jan Wieck) # - - - - -From pgsql-hackers-owner+M67@postgresql.org Tue Oct 31 19:18:16 2000 -Received: from mail.postgresql.org ([216.126.85.28]) - by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id TAA08916 - for ; Tue, 31 Oct 2000 19:18:15 -0500 (EST) -Received: from mail.postgresql.org ([216.126.85.28]) - by mail.postgresql.org (8.11.1/8.11.1) with SMTP id eA10IOl60635; - Tue, 31 Oct 2000 19:18:24 -0500 (EST) - (envelope-from pgsql-hackers-owner+M67@postgresql.org) -Received: from ara.zf.jcu.cz (ara.zf.jcu.cz [160.217.161.4]) - by mail.postgresql.org (8.11.1/8.11.1) with ESMTP id eA10H8l60400 - for ; Tue, 31 Oct 2000 19:17:08 -0500 (EST) - (envelope-from zakkr@zf.jcu.cz) -Received: from localhost (zakkr@localhost) - by ara.zf.jcu.cz (8.9.3/8.9.3/Debian 8.9.3-21) with SMTP id BAA32036; - Wed, 1 Nov 2000 01:16:42 +0100 -Date: Wed, 1 Nov 2000 01:16:42 +0100 (CET) -From: Karel Zak -To: Alfred Perlstein -cc: pgsql-hackers@postgresql.org -Subject: Re: [HACKERS] Query cache import? -In-Reply-To: <20001031151144.F22110@fw.wintelcom.net> -Message-ID: -MIME-Version: 1.0 -Content-Type: TEXT/PLAIN; charset=US-ASCII -Precedence: bulk -Sender: pgsql-hackers-owner@postgresql.org -Status: OR - - -On Tue, 31 Oct 2000, Alfred Perlstein wrote: - -> I never saw much traffic regarding Karel's work on making stored -> proceedures: -> -> http://people.freebsd.org/~alfred/karel-pgsql.txt -> -> What happened with this? It looked pretty interesting. :( - - It's probably a little about me :-) ... well, - - My query cache is in usable state and it's efficient for all -things those motivate me to work on this. - - some basic features: - - - share parsed plans between backends in shared memory - - store plans to private backend hash table - - use parameters for stored queries - - better design for SPI - - memory usage for saved plans - - save plans "by key" - - - The current query cache code depend on 7.1 memory management. After -official 7.1 release I prepare patch with query cache+SPI (if not -hit me over head, please ..) - - All what will doing next time not depend on me, *it's on code developers*. - - For example Jan has interesting idea about caching all plans which -processing backend. But it's far future and IMHO we must go by small -steps to Oracle's funeral :-) - - If I need the query cache in the my work (typical for some web+pgsql) or -will some public interest I will continue on this, if not I freeze it. -(Exists more interesting work like http://mape.jcu.cz ... sorry of -advertising :-) - - Karel - - - - - - -From pgsql-hackers-owner+M312@postgresql.org Mon Nov 6 03:27:32 2000 -Received: from mail.postgresql.org (webmail.postgresql.org [216.126.85.28]) - by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id DAA28404 - for ; Mon, 6 Nov 2000 03:27:32 -0500 (EST) -Received: from mail.postgresql.org (webmail.postgresql.org [216.126.85.28]) - by mail.postgresql.org (8.11.1/8.11.1) with SMTP id eA68Pos51966; - Mon, 6 Nov 2000 03:25:50 -0500 (EST) - (envelope-from pgsql-hackers-owner+M312@postgresql.org) -Received: from ara.zf.jcu.cz (ara.zf.jcu.cz [160.217.161.4]) - by mail.postgresql.org (8.11.1/8.11.1) with ESMTP id eA68Fes50414 - for ; Mon, 6 Nov 2000 03:15:40 -0500 (EST) - (envelope-from zakkr@zf.jcu.cz) -Received: from localhost (zakkr@localhost) - by ara.zf.jcu.cz (8.9.3/8.9.3/Debian 8.9.3-21) with SMTP id JAA20862; - Mon, 6 Nov 2000 09:15:04 +0100 -Date: Mon, 6 Nov 2000 09:15:04 +0100 (CET) -From: Karel Zak -To: Christof Petig -cc: Zeugswetter Andreas SB , - The Hermit Hacker , pgsql-hackers@postgresql.org -Subject: Re: AW: [HACKERS] Re: [GENERAL] Query caching -In-Reply-To: <3A02DDFF.E8CBFCF3@wtal.de> -Message-ID: -MIME-Version: 1.0 -Content-Type: TEXT/PLAIN; charset=US-ASCII -Precedence: bulk -Sender: pgsql-hackers-owner@postgresql.org -Status: OR - - -On Fri, 3 Nov 2000, Christof Petig wrote: - -> Karel Zak wrote: -> -> > On Thu, 2 Nov 2000, Zeugswetter Andreas SB wrote: -> > -> > > -> > > > Well I can re-write and resubmit this patch. Add it as a -> > > > compile time option -> > > > is not bad idea. Second possibility is distribute it as patch -> > > > in the contrib -> > > > tree. And if it until not good tested not dirty with this main tree... -> > > > -> > > > Ok, I next week prepare it... -> > > -> > > One thing that worries me though is, that it extends the sql language, -> > > and there has been no discussion about the chosen syntax. -> > > -> > > Imho the standard embedded SQL syntax (prepare ...) could be a -> > > starting point. -> > -> > Yes, you are right... my PREPARE/EXECUTE is not too much ready to SQL92, -> > I some old letter I speculate about "SAVE/EXECUTE PLAN" instead -> > PREPARE/EXECUTE. But don't forget, it will *experimental* patch... we can -> > change it in future ..etc. -> > -> > Karel -> -> [Sorry, I didn't look into your patch, yet.] - - Please, read my old query cache and PREPARE/EXECUTE description... - -> What about parameters? Normally you can prepare a statement and execute it - - We have in PG parameters, see SPI, but now it's used inside backend only -and not exist statement that allows to use this feature in be<->fe. - -> using different parameters. AFAIK postgres' frontend-backend protocol is not -> designed to take parameters for statements (e.g. like result presents -> results). A very long road to go. -> By the way, I'm somewhat interested in getting this feature in. Perhaps it -> should be part of a protocol redesign (e.g. binary parameters/results). -> Handling endianness is one aspect, floats are harder (but float->ascii->float -> sometimes fails as well). - - PREPARE AS - [ USING type, ... typeN ] - [ NOSHARE | SHARE | GLOBAL ] - - EXECUTE - [ INTO [ TEMPORARY | TEMP ] [ TABLE ] new_table ] - [ USING val, ... valN ] - [ NOSHARE | SHARE | GLOBAL ] - - DEALLOCATE PREPARE - [ [ NOSHARE | SHARE | GLOBAL ]] - [ ALL | ALL INTERNAL ] - - -An example: - - -PREPARE chris_query AS SELECT * FROM pg_class WHERE relname = $1 USING text; - -EXECUTE chris_query USING 'pg_shadow'; - - - Or mean you something other? - Karel - - - - - - -From pgsql-hackers-owner+M444@postgresql.org Thu Nov 9 03:32:10 2000 -Received: from mail.postgresql.org (webmail.postgresql.org [216.126.85.28]) - by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id DAA09953 - for ; Thu, 9 Nov 2000 03:32:09 -0500 (EST) -Received: from mail.postgresql.org (webmail.postgresql.org [216.126.85.28]) - by mail.postgresql.org (8.11.1/8.11.1) with SMTP id eA98RSs11426; - Thu, 9 Nov 2000 03:27:28 -0500 (EST) - (envelope-from pgsql-hackers-owner+M444@postgresql.org) -Received: from ara.zf.jcu.cz (ara.zf.jcu.cz [160.217.161.4]) - by mail.postgresql.org (8.11.1/8.11.1) with ESMTP id eA98OPs11045; - Thu, 9 Nov 2000 03:24:25 -0500 (EST) - (envelope-from zakkr@zf.jcu.cz) -Received: from localhost (zakkr@localhost) - by ara.zf.jcu.cz (8.9.3/8.9.3/Debian 8.9.3-21) with SMTP id JAA08951; - Thu, 9 Nov 2000 09:23:41 +0100 -Date: Thu, 9 Nov 2000 09:23:41 +0100 (CET) -From: Karel Zak -To: Christof Petig -cc: PostgreSQL Hackers , - Michael Meskes , - Zeugswetter Andreas SB , - The Hermit Hacker -Subject: Re: AW: [HACKERS] Re: [GENERAL] Query caching -In-Reply-To: <3A096BCE.F9887955@wtal.de> -Message-ID: -MIME-Version: 1.0 -Content-Type: TEXT/PLAIN; charset=US-ASCII -Precedence: bulk -Sender: pgsql-hackers-owner@postgresql.org -Status: OR - - -On Wed, 8 Nov 2000, Christof Petig wrote: - -> Karel Zak wrote: -> -> > > What about parameters? Normally you can prepare a statement and execute it -> > -> > We have in PG parameters, see SPI, but now it's used inside backend only -> > and not exist statement that allows to use this feature in be<->fe. -> -> Sad. Since ecpg would certainly benefit from this. -> -> > > using different parameters. AFAIK postgres' frontend-backend protocol is not -> > > designed to take parameters for statements (e.g. like result presents -> > > results). A very long road to go. -> > > By the way, I'm somewhat interested in getting this feature in. Perhaps it -> > > should be part of a protocol redesign (e.g. binary parameters/results). -> > > Handling endianness is one aspect, floats are harder (but float->ascii->float -> > > sometimes fails as well). -> > -> > PREPARE AS -> > [ USING type, ... typeN ] -> > [ NOSHARE | SHARE | GLOBAL ] -> > -> > EXECUTE -> > [ INTO [ TEMPORARY | TEMP ] [ TABLE ] new_table ] -> > [ USING val, ... valN ] -> > [ NOSHARE | SHARE | GLOBAL ] -> > -> > DEALLOCATE PREPARE -> > [ [ NOSHARE | SHARE | GLOBAL ]] -> > [ ALL | ALL INTERNAL ] -> > -> > An example: -> > -> > PREPARE chris_query AS SELECT * FROM pg_class WHERE relname = $1 USING text; -> -> I would prefer '?' as a parameter name, since this is in the embedded sql standard -> (do you have a copy of the 94 draft? I can mail mine to you?) - - This not depend on query cache. The '$n' is PostgreSQL query parametr -keyword and is defined in standard parser. The PREPARE statement not parsing -query it's job for standard parser. - -> Also the standard says a whole lot about guessing the parameter's type. -> -> Also I vote for ?::type or type(?) or sql's cast(...) (don't know it's syntax) -> instead of abusing the using keyword. - -The postgresql executor expect types of parametrs in separate input (array). -I not sure how much expensive/executable is survey it from query. - -> > EXECUTE chris_query USING 'pg_shadow'; -> -> Great idea of yours to implement this! Since I was thinking about implementing a -> more decent schema for ecpg but had no mind to touch the backend and be-fe -> protocol (yet). -> It would be desirable to do an 'execute immediate using', since using input -> parameters would take a lot of code away from ecpg. - -By the way, PREPARE/EXECUTE is face only. More interesting in this period is -query-cache-kernel. SQL92 is really a little unlike my PREPARE/EXECUTE. - - Karel - - -From pgsql-hackers-owner+M9563@postgresql.org Thu May 31 16:31:59 2001 -Return-path: -Received: from postgresql.org (webmail.postgresql.org [216.126.85.28]) - by candle.pha.pa.us (8.10.1/8.10.1) with ESMTP id f4VKVxc26942 - for ; Thu, 31 May 2001 16:31:59 -0400 (EDT) -Received: from postgresql.org.org (webmail.postgresql.org [216.126.85.28]) - by postgresql.org (8.11.3/8.11.1) with SMTP id f4VKVIE38645; - Thu, 31 May 2001 16:31:18 -0400 (EDT) - (envelope-from pgsql-hackers-owner+M9563@postgresql.org) -Received: from ara.zf.jcu.cz (ara.zf.jcu.cz [160.217.161.4]) - by postgresql.org (8.11.3/8.11.1) with ESMTP id f4VKNVE35356 - for ; Thu, 31 May 2001 16:23:31 -0400 (EDT) - (envelope-from zakkr@zf.jcu.cz) -Received: (from zakkr@localhost) - by ara.zf.jcu.cz (8.9.3/8.9.3/Debian 8.9.3-21) id WAA19957; - Thu, 31 May 2001 22:23:26 +0200 -Date: Thu, 31 May 2001 22:23:26 +0200 -From: Karel Zak -To: Roberto Abalde -cc: pgsql-hackers@postgresql.org -Subject: Re: [HACKERS] Cache for query plans -Message-ID: <20010531222326.B16862@ara.zf.jcu.cz> -References: <000701c0e932$d17646c0$c6023dc8@ultra> -MIME-Version: 1.0 -Content-Type: text/plain; charset=us-ascii -User-Agent: Mutt/1.0.1i -In-Reply-To: <000701c0e932$d17646c0$c6023dc8@ultra>; from roberto.abalde@galego21.org on Wed, May 30, 2001 at 03:00:53PM -0300 -Precedence: bulk -Sender: pgsql-hackers-owner@postgresql.org -Status: ORr - -On Wed, May 30, 2001 at 03:00:53PM -0300, Roberto Abalde wrote: -> Hi, -> -> I need to implement a cache for query plans as part of my BSc thesis. Does -> anybody know what happened to Karel Zak's patch? -> - - - Hi, - - - my patch is on my ftp and nobody works on it, but I mean it's good -begin for some next work. I not sure with implement this experimental -patch (but usable) to official sources. For example Jan has more complex -idea about query plan cache ... but first time we must solve some -sub-problems like memory management in shared memory that is transparently -for starndard routines like copy query plan ... and Tom isn't sure with -query cache in shared memory...etc. Too much queries, but less answers :-) - - - Karel -> -> PS: Sorry for my english :( - - - Do you anytime read any my mail :-) - - - Karel - - --- - Karel Zak - http://home.zf.jcu.cz/~zakkr/ - - C, PostgreSQL, PHP, WWW, http://docs.linux.cz, http://mape.jcu.cz - ----------------------------(end of broadcast)--------------------------- -TIP 2: you can get off all lists at once with the unregister command - (send "unregister YourEmailAddressHere" to majordomo@postgresql.org) - -From pgsql-hackers-owner+M21218@postgresql.org Fri Apr 12 04:52:19 2002 -Return-path: -Received: from postgresql.org (postgresql.org [64.49.215.8]) - by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g3C8qIS25666 - for ; Fri, 12 Apr 2002 04:52:18 -0400 (EDT) -Received: from postgresql.org (postgresql.org [64.49.215.8]) - by postgresql.org (Postfix) with SMTP - id AE2FA4769F1; Fri, 12 Apr 2002 03:54:34 -0400 (EDT) -Received: from ara.zf.jcu.cz (ara.zf.jcu.cz [160.217.161.4]) - by postgresql.org (Postfix) with ESMTP id A05A94769DC - for ; Fri, 12 Apr 2002 03:51:27 -0400 (EDT) -Received: from ara.zf.jcu.cz (LOCALHOST [127.0.0.1]) - by ara.zf.jcu.cz (8.12.1/8.12.1/Debian -5) with ESMTP id g3C7pHBK012031; - Fri, 12 Apr 2002 09:51:17 +0200 -Received: (from zakkr@localhost) - by ara.zf.jcu.cz (8.12.1/8.12.1/Debian -5) id g3C7pGum012030; - Fri, 12 Apr 2002 09:51:16 +0200 -Date: Fri, 12 Apr 2002 09:51:16 +0200 -From: Karel Zak -To: pgsql-hackers@postgresql.org -cc: Hiroshi Inoue -Subject: Re: [HACKERS] 7.3 schedule -Message-ID: <20020412095116.B6370@zf.jcu.cz> -References: <3CB52C54.4020507@freaky-namuh.com> <20020411115434.201ff92f.nconway@klamath.dyndns.org> <3CB61DAB.5010601@freaky-namuh.com> <24184.1018581907@sss.pgh.pa.us> <3CB65B49.93F2F790@tpf.co.jp> <20020412004134.5d35a2dd.nconway@klamath.dyndns.org> -MIME-Version: 1.0 -Content-Type: text/plain; charset=us-ascii -Content-Disposition: inline -User-Agent: Mutt/1.2.5i -In-Reply-To: <20020412004134.5d35a2dd.nconway@klamath.dyndns.org>; from nconway@klamath.dyndns.org on Fri, Apr 12, 2002 at 12:41:34AM -0400 -Precedence: bulk -Sender: pgsql-hackers-owner@postgresql.org -Status: OR - -On Fri, Apr 12, 2002 at 12:41:34AM -0400, Neil Conway wrote: -> On Fri, 12 Apr 2002 12:58:01 +0900 -> "Hiroshi Inoue" wrote: -> > -> > Just a confirmation. -> > Someone is working on PREPARE/EXECUTE ? -> > What about Karel's work ? - - Right question :-) - -> I am. My work is based on Karel's stuff -- at the moment I'm still -> basically working on getting Karel's patch to play nicely with -> current sources; once that's done I'll be addressing whatever -> issues are stopping the code from getting into CVS. - - My patch (qcache) for PostgreSQL 7.0 is available at - ftp://ftp2.zf.jcu.cz/users/zakkr/pg/. - - I very look forward to Neil's work on this. - - Notes: - - * It's experimental patch, but usable. All features below mentioned - works. - - * PREPARE/EXECUTE is not only SQL statements, I think good idea is - create something common and robus for query-plan caching, - beacuse there is for example SPI too. The RI triggers are based - on SPI_saveplan(). - - * My patch knows EXECUTE INTO feature: - - PREPARE foo AS SELECT * FROM pg_class WHERE relname ~~ $1 USING text; - - EXECUTE foo USING 'pg%'; <-- standard select - - EXECUTE foo INTO TEMP newtab USING 'pg%'; <-- select into - - - * The patch allows store query-planns to shared memory and is - possible EXECUTE it at more backends (over same DB) and planns - are persistent across connetions. For this feature I create special - memory context subsystem (like current aset.c, but it works with - IPC shared memory). - - This is maybe too complex solution and (maybe) sufficient is cache - query in one backend only. I know unbelief about this shared - memory solution (Tom?). - - - Karel - - - My experimental patch README (excuse my English): - - Implementation - ~~~~~~~~~~~~~~ - - The qCache allows save queryTree and queryPlan. There is available are - two space for data caching. - - LOCAL - data are cached in backend non-shared memory and data aren't - available in other backends. - - SHARE - data are cached in backend shared memory and data are - visible in all backends. - - Because size of share memory pool is limited and it is set during - postmaster start up, the qCache must remove all old planns if pool is - full. You can mark each entry as "REMOVEABLE" or "NOTREMOVEABLE". - - A removeable entry is removed if pool is full. - - A not-removeable entry must be removed via qCache_Remove() or - the other routines. The qCache not remove this entry itself. - - All records in qCache are cached (in the hash table) under some key. - The qCache knows two alternate of key --- "KEY_STRING" and "KEY_BINARY". - - The qCache API not allows access to shared memory, all cached planns that - API returns are copy to CurrentMemoryContext. All (qCache_ ) routines lock - shmem itself (exception is qCache_RemoveOldest_ShareRemoveAble()). - - - for locking is used spin lock. - - Memory management - ~~~~~~~~~~~~~~~~~ - The qCache use for qCache's shared pool its memory context independent on - standard aset/mcxt, but use compatible API --- it allows to use standard - palloc() (it is very needful for basic plan-tree operations, an example - for copyObject()). The qCache memory management is very simular to current - aset.c code. It is chunk-ed blocks too, but the block is smaller - 1024b. - - The number of blocks is available set in postmaster 'argv' via option - '-Z'. - - For plan storing is used separate MemoryContext for each plan, it - is good idea (Hiroshi's ?), bucause create new context is simple and - inexpensive and allows easy destroy (free) cached plan. This method is - used in my SPI overhaul instead TopMemoryContext feeding. - - Postmaster - ~~~~~~~~~~ - The query cache memory is init during potmaster startup. The size of - query cache pool is set via '-Z ' switch --- default - is 100 blocks where 1 block = 1024b, it is sufficient for 20-30 cached - planns. One query needs somewhere 3-10 blocks, for example query like - - PREPARE sel AS SELECT * FROM pg_class; - - needs 10Kb, because table pg_class has very much columns. - - Note: for development I add SQL function: "SELECT qcache_state();", - this routine show usage of qCache. - - SPI - ~~~ - I a little overwrite SPI save plan method and remove TopMemoryContext - "feeding". - - Standard SPI: - - SPI_saveplan() - save each plan to separate standard memory context. - - SPI_freeplan() - free plan. - - By key SPI: - - It is SPI interface for query cache and allows save planns to SHARED - or LOCAL cache 'by' arbitrary key (string or binary). Routines: - - SPI_saveplan_bykey() - save plan to query cache - - SPI_freeplan_bykey() - remove plan from query cache - - SPI_fetchplan_bykey() - fetch plan saved in query cache - - SPI_execp_bykey() - execute (via SPI) plan saved in query - cache - - - now, users can write functions that save planns to shared memory - and planns are visible in all backend and are persistent arcoss - connection. - - Example: - ~~~~~~~ - /* ---------- - * Save/exec query from shared cache via string key - * ---------- - */ - int keySize = 0; - flag = SPI_BYKEY_SHARE | SPI_BYKEY_STRING; - char *key = "my unique key"; - - res = SPI_execp_bykey(values, nulls, tcount, key, flag, keySize); - - if (res == SPI_ERROR_PLANNOTFOUND) - { - /* --- not plan in cache - must create it --- */ - - void *plan; - - plan = SPI_prepare(querystr, valnum, valtypes); - SPI_saveplan_bykey(plan, key, keySize, flag); - - res = SPI_execute(plan, values, Nulls, tcount); - } - - elog(NOTICE, "Processed: %d", SPI_processed); - - - PREPARE/EXECUTE - ~~~~~~~~~~~~~~~ - * Syntax: - - PREPARE AS - [ USING type, ... typeN ] - [ NOSHARE | SHARE | GLOBAL ] - - EXECUTE - [ INTO [ TEMPORARY | TEMP ] [ TABLE ] new_table ] - [ USING val, ... valN ] - [ NOSHARE | SHARE | GLOBAL ] - - DEALLOCATE PREPARE - [ [ NOSHARE | SHARE | GLOBAL ]] - [ ALL | ALL INTERNAL ] - - - I know that it is a little out of SQL92... (use CREATE/DROP PLAN instead - this?) --- what mean SQL standard guru? - - * Where: - - NOSHARE --- cached in local backend query cache - not accessable - from the others backends and not is persisten a across - conection. - - SHARE --- cached in shared query cache and accessable from - all backends which work over same database. - - GLOBAL --- cached in shared query cache and accessable from - all backends and all databases. - - - default is 'SHARE' - - Deallocate: - - ALL --- deallocate all users's plans - - ALL INTERNAL --- deallocate all internal plans, like planns - cached via SPI. It is needful if user - alter/drop table ...etc. - - * Parameters: - - "USING" part in the prepare statement is for datetype setting for - paremeters in the query. For example: - - PREPARE sel AS SELECT * FROM pg_class WHERE relname ~~ $1 USING text; - - EXECUTE sel USING 'pg%'; - - - * Limitation: - - - prepare/execute allow use full statement of SELECT/INSERT/DELETE/ - UPDATE. - - possible is use union, subselects, limit, ofset, select-into - - - Performance: - ~~~~~~~~~~~ - * the SPI - - - I for my tests a little change RI triggers to use SPI by_key API - and save planns to shared qCache instead to internal RI hash table. - - The RI use very simple (for parsing) queries and qCache interest is - not visible. It's better if backend very often startup and RI check - always same tables. In this situation speed go up --- 10-12%. - (This snapshot not include this RI change.) - - But all depend on how much complicate for parser is query in - trigger. - - * PREPARE/EXECUTE - - - For tests I use query that not use some table (the executor is - in boredom state), but is difficult for the parser. An example: - - SELECT 'a text ' || (10*10+(100^2))::text || ' next text ' || cast - (date_part('year', timestamp 'now') AS text ); - - - (10000 * this query): - - standard select: 54 sec - via prepare/execute: 4 sec (93% better) - - IMHO it is nod bad. - - - For standard query like: - - SELECT u.usename, r.relname FROM pg_class r, pg_user u WHERE - r.relowner = u.usesysid; - - it is with PREPARE/EXECUTE 10-20% faster. - --- - Karel Zak - http://home.zf.jcu.cz/~zakkr/ - - C, PostgreSQL, PHP, WWW, http://docs.linux.cz, http://mape.jcu.cz - ----------------------------(end of broadcast)--------------------------- -TIP 4: Don't 'kill -9' the postmaster - -From pgsql-hackers-owner+M21228@postgresql.org Fri Apr 12 10:15:34 2002 -Return-path: -Received: from postgresql.org (postgresql.org [64.49.215.8]) - by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g3CEFXS29835 - for ; Fri, 12 Apr 2002 10:15:33 -0400 (EDT) -Received: from postgresql.org (postgresql.org [64.49.215.8]) - by postgresql.org (Postfix) with SMTP - id 7BFE1475A55; Fri, 12 Apr 2002 10:15:27 -0400 (EDT) -Received: from sss.pgh.pa.us (unknown [192.204.191.242]) - by postgresql.org (Postfix) with ESMTP id 5659B474E71 - for ; Fri, 12 Apr 2002 10:14:31 -0400 (EDT) -Received: from sss2.sss.pgh.pa.us (tgl@localhost [127.0.0.1]) - by sss.pgh.pa.us (8.11.4/8.11.4) with ESMTP id g3CEEQF27238; - Fri, 12 Apr 2002 10:14:26 -0400 (EDT) -To: Karel Zak -cc: pgsql-hackers@postgresql.org, Neil Conway -Subject: Re: [HACKERS] 7.3 schedule -In-Reply-To: <20020412095116.B6370@zf.jcu.cz> -References: <3CB52C54.4020507@freaky-namuh.com> <20020411115434.201ff92f.nconway@klamath.dyndns.org> <3CB61DAB.5010601@freaky-namuh.com> <24184.1018581907@sss.pgh.pa.us> <3CB65B49.93F2F790@tpf.co.jp> <20020412004134.5d35a2dd.nconway@klamath.dyndns.org> <20020412095116.B6370@zf.jcu.cz> -Comments: In-reply-to Karel Zak - message dated "Fri, 12 Apr 2002 09:51:16 +0200" -Date: Fri, 12 Apr 2002 10:14:26 -0400 -Message-ID: <27235.1018620866@sss.pgh.pa.us> -From: Tom Lane -Precedence: bulk -Sender: pgsql-hackers-owner@postgresql.org -Status: ORr - -Karel Zak writes: -> * The patch allows store query-planns to shared memory and is -> possible EXECUTE it at more backends (over same DB) and planns -> are persistent across connetions. For this feature I create special -> memory context subsystem (like current aset.c, but it works with -> IPC shared memory). -> This is maybe too complex solution and (maybe) sufficient is cache -> query in one backend only. I know unbelief about this shared -> memory solution (Tom?). - -Yes, that is the part that was my sticking point last time around. -(1) Because shared memory cannot be extended on-the-fly, I think it is -a very bad idea to put data structures in there without some well -thought out way of predicting/limiting their size. (2) How the heck do -you get rid of obsoleted cached plans, if the things stick around in -shared memory even after you start a new backend? (3) A shared cache -requires locking; contention among multiple backends to access that -shared resource could negate whatever performance benefit you might hope -to realize from it. - -A per-backend cache kept in local memory avoids all of these problems, -and I have seen no numbers to make me think that a shared plan cache -would achieve significantly more performance benefit than a local one. - - regards, tom lane - ----------------------------(end of broadcast)--------------------------- -TIP 5: Have you checked our extensive FAQ? - -http://www.postgresql.org/users-lounge/docs/faq.html - -From pgsql-hackers-owner+M21233@postgresql.org Fri Apr 12 12:26:32 2002 -Return-path: -Received: from postgresql.org (postgresql.org [64.49.215.8]) - by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g3CGQVS11018 - for ; Fri, 12 Apr 2002 12:26:31 -0400 (EDT) -Received: from postgresql.org (postgresql.org [64.49.215.8]) - by postgresql.org (Postfix) with SMTP - id 38DBB475B20; Fri, 12 Apr 2002 12:22:08 -0400 (EDT) -Received: from candle.pha.pa.us (216-55-132-35.dsl.san-diego.abac.net [216.55.132.35]) - by postgresql.org (Postfix) with ESMTP id 0DA70475B9E - for ; Fri, 12 Apr 2002 12:21:15 -0400 (EDT) -Received: (from pgman@localhost) - by candle.pha.pa.us (8.11.6/8.10.1) id g3CGL4310492; - Fri, 12 Apr 2002 12:21:04 -0400 (EDT) -From: Bruce Momjian -Message-ID: <200204121621.g3CGL4310492@candle.pha.pa.us> -Subject: Re: [HACKERS] 7.3 schedule -In-Reply-To: <27235.1018620866@sss.pgh.pa.us> -To: Tom Lane -Date: Fri, 12 Apr 2002 12:21:04 -0400 (EDT) -cc: Karel Zak , pgsql-hackers@postgresql.org, - Neil Conway -X-Mailer: ELM [version 2.4ME+ PL97 (25)] -MIME-Version: 1.0 -Content-Transfer-Encoding: 7bit -Content-Type: text/plain; charset=US-ASCII -Precedence: bulk -Sender: pgsql-hackers-owner@postgresql.org -Status: OR - -Tom Lane wrote: -> Karel Zak writes: -> > * The patch allows store query-planns to shared memory and is -> > possible EXECUTE it at more backends (over same DB) and planns -> > are persistent across connetions. For this feature I create special -> > memory context subsystem (like current aset.c, but it works with -> > IPC shared memory). -> > This is maybe too complex solution and (maybe) sufficient is cache -> > query in one backend only. I know unbelief about this shared -> > memory solution (Tom?). -> -> Yes, that is the part that was my sticking point last time around. -> (1) Because shared memory cannot be extended on-the-fly, I think it is -> a very bad idea to put data structures in there without some well -> thought out way of predicting/limiting their size. (2) How the heck do -> you get rid of obsoleted cached plans, if the things stick around in -> shared memory even after you start a new backend? (3) A shared cache -> requires locking; contention among multiple backends to access that -> shared resource could negate whatever performance benefit you might hope -> to realize from it. -> -> A per-backend cache kept in local memory avoids all of these problems, -> and I have seen no numbers to make me think that a shared plan cache -> would achieve significantly more performance benefit than a local one. - -Certainly a shared cache would be good for apps that connect to issue a -single query frequently. In such cases, there would be no local cache -to use. - --- - Bruce Momjian | http://candle.pha.pa.us - pgman@candle.pha.pa.us | (610) 853-3000 - + If your life is a hard drive, | 830 Blythe Avenue - + Christ can be your backup. | Drexel Hill, Pennsylvania 19026 - ----------------------------(end of broadcast)--------------------------- -TIP 4: Don't 'kill -9' the postmaster - -From pgsql-hackers-owner+M21234@postgresql.org Fri Apr 12 12:44:12 2002 -Return-path: -Received: from postgresql.org (postgresql.org [64.49.215.8]) - by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g3CGiBS12385 - for ; Fri, 12 Apr 2002 12:44:12 -0400 (EDT) -Received: from postgresql.org (postgresql.org [64.49.215.8]) - by postgresql.org (Postfix) with SMTP - id AEAA7475C6C; Fri, 12 Apr 2002 12:43:17 -0400 (EDT) -Received: from barry.xythos.com (h-64-105-36-191.SNVACAID.covad.net [64.105.36.191]) - by postgresql.org (Postfix) with ESMTP id CE58C47598E - for ; Fri, 12 Apr 2002 12:42:48 -0400 (EDT) -Received: from xythos.com (localhost.localdomain [127.0.0.1]) - by barry.xythos.com (8.11.6/8.11.6) with ESMTP id g3CGgaI02920; - Fri, 12 Apr 2002 09:42:36 -0700 -Message-ID: <3CB70E7C.3090801@xythos.com> -Date: Fri, 12 Apr 2002 09:42:36 -0700 -From: Barry Lind -User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:0.9.9) Gecko/20020310 -X-Accept-Language: en-us, en -MIME-Version: 1.0 -To: Tom Lane -cc: Karel Zak , pgsql-hackers@postgresql.org, - Neil Conway -Subject: Re: [HACKERS] 7.3 schedule -References: <3CB52C54.4020507@freaky-namuh.com> <20020411115434.201ff92f.nconway@klamath.dyndns.org> <3CB61DAB.5010601@freaky-namuh.com> <24184.1018581907@sss.pgh.pa.us> <3CB65B49.93F2F790@tpf.co.jp> <20020412004134.5d35a2dd.nconway@klamath.dyndns.org> <20020412095116.B6370@zf.jcu.cz> <27235.1018620866@sss.pgh.pa.us> -Content-Type: text/plain; charset=us-ascii; format=flowed -Content-Transfer-Encoding: 7bit -Precedence: bulk -Sender: pgsql-hackers-owner@postgresql.org -Status: ORr - - - -Tom Lane wrote: -> Yes, that is the part that was my sticking point last time around. -> (1) Because shared memory cannot be extended on-the-fly, I think it is -> a very bad idea to put data structures in there without some well -> thought out way of predicting/limiting their size. (2) How the heck do -> you get rid of obsoleted cached plans, if the things stick around in -> shared memory even after you start a new backend? (3) A shared cache -> requires locking; contention among multiple backends to access that -> shared resource could negate whatever performance benefit you might hope -> to realize from it. -> -> A per-backend cache kept in local memory avoids all of these problems, -> and I have seen no numbers to make me think that a shared plan cache -> would achieve significantly more performance benefit than a local one. -> - -Oracle's implementation is a shared cache for all plans. This was -introduced in Oracle 6 or 7 (I don't remember which anymore). The net -effect was that in general there was a significant performance -improvement with the shared cache. However poorly written apps can now -bring the Oracle database to its knees because of the locking issues -associated with the shared cache. For example if the most frequently -run sql statements are coded poorly (i.e. they don't use bind variables, -eg. 'select bar from foo where foobar = $1' vs. 'select bar from foo -where foobar = || somevalue' (where somevalue is likely to be -different on every call)) the shared cache doesn't help and its overhead -becomes significant. - -thanks, ---Barry - - - ----------------------------(end of broadcast)--------------------------- -TIP 3: if posting/reading through Usenet, please send an appropriate -subscribe-nomail command to majordomo@postgresql.org so that your -message can get through to the mailing list cleanly - -From pgsql-hackers-owner+M21237@postgresql.org Fri Apr 12 12:50:28 2002 -Return-path: -Received: from postgresql.org (postgresql.org [64.49.215.8]) - by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g3CGoRS13005 - for ; Fri, 12 Apr 2002 12:50:28 -0400 (EDT) -Received: from postgresql.org (postgresql.org [64.49.215.8]) - by postgresql.org (Postfix) with SMTP - id 32A28475BA1; Fri, 12 Apr 2002 12:50:15 -0400 (EDT) -Received: from candle.pha.pa.us (216-55-132-35.dsl.san-diego.abac.net [216.55.132.35]) - by postgresql.org (Postfix) with ESMTP id 07F1E475892 - for ; Fri, 12 Apr 2002 12:49:43 -0400 (EDT) -Received: (from pgman@localhost) - by candle.pha.pa.us (8.11.6/8.10.1) id g3CGnbw12950; - Fri, 12 Apr 2002 12:49:37 -0400 (EDT) -From: Bruce Momjian -Message-ID: <200204121649.g3CGnbw12950@candle.pha.pa.us> -Subject: Re: [HACKERS] 7.3 schedule -In-Reply-To: <3CB70E7C.3090801@xythos.com> -To: Barry Lind -Date: Fri, 12 Apr 2002 12:49:37 -0400 (EDT) -cc: Tom Lane , Karel Zak , - pgsql-hackers@postgresql.org, Neil Conway -X-Mailer: ELM [version 2.4ME+ PL97 (25)] -MIME-Version: 1.0 -Content-Transfer-Encoding: 7bit -Content-Type: text/plain; charset=US-ASCII -Precedence: bulk -Sender: pgsql-hackers-owner@postgresql.org -Status: OR - -Barry Lind wrote: -> Oracle's implementation is a shared cache for all plans. This was -> introduced in Oracle 6 or 7 (I don't remember which anymore). The net -> effect was that in general there was a significant performance -> improvement with the shared cache. However poorly written apps can now -> bring the Oracle database to its knees because of the locking issues -> associated with the shared cache. For example if the most frequently -> run sql statements are coded poorly (i.e. they don't use bind variables, -> eg. 'select bar from foo where foobar = $1' vs. 'select bar from foo -> where foobar = || somevalue' (where somevalue is likely to be -> different on every call)) the shared cache doesn't help and its overhead -> becomes significant. - -This is very interesting. We have always been concerned that shared -cache invalidation could cause more of a performance problem that the -shared cache gives benefit, and it sounds like you are saying exactly -that. - --- - Bruce Momjian | http://candle.pha.pa.us - pgman@candle.pha.pa.us | (610) 853-3000 - + If your life is a hard drive, | 830 Blythe Avenue - + Christ can be your backup. | Drexel Hill, Pennsylvania 19026 - ----------------------------(end of broadcast)--------------------------- -TIP 2: you can get off all lists at once with the unregister command - (send "unregister YourEmailAddressHere" to majordomo@postgresql.org) - -From pgsql-hackers-owner+M21238@postgresql.org Fri Apr 12 12:51:55 2002 -Return-path: -Received: from postgresql.org (postgresql.org [64.49.215.8]) - by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g3CGptS13119 - for ; Fri, 12 Apr 2002 12:51:55 -0400 (EDT) -Received: from postgresql.org (postgresql.org [64.49.215.8]) - by postgresql.org (Postfix) with SMTP - id C599D475BC6; Fri, 12 Apr 2002 12:51:47 -0400 (EDT) -Received: from sss.pgh.pa.us (unknown [192.204.191.242]) - by postgresql.org (Postfix) with ESMTP id C9F94475892 - for ; Fri, 12 Apr 2002 12:51:26 -0400 (EDT) -Received: from sss2.sss.pgh.pa.us (tgl@localhost [127.0.0.1]) - by sss.pgh.pa.us (8.11.4/8.11.4) with ESMTP id g3CGpQF27967; - Fri, 12 Apr 2002 12:51:27 -0400 (EDT) -To: Bruce Momjian -cc: Karel Zak , pgsql-hackers@postgresql.org, - Neil Conway -Subject: Re: [HACKERS] 7.3 schedule -In-Reply-To: <200204121621.g3CGL4310492@candle.pha.pa.us> -References: <200204121621.g3CGL4310492@candle.pha.pa.us> -Comments: In-reply-to Bruce Momjian - message dated "Fri, 12 Apr 2002 12:21:04 -0400" -Date: Fri, 12 Apr 2002 12:51:26 -0400 -Message-ID: <27964.1018630286@sss.pgh.pa.us> -From: Tom Lane -Precedence: bulk -Sender: pgsql-hackers-owner@postgresql.org -Status: OR - -Bruce Momjian writes: -> Certainly a shared cache would be good for apps that connect to issue a -> single query frequently. In such cases, there would be no local cache -> to use. - -We have enough other problems with the single-query-per-connection -scenario that I see no reason to believe that a shared plan cache will -help materially. The correct answer for those folks will *always* be -to find a way to reuse the connection. - - regards, tom lane - ----------------------------(end of broadcast)--------------------------- -TIP 4: Don't 'kill -9' the postmaster - -From pgsql-hackers-owner+M21241@postgresql.org Fri Apr 12 16:25:46 2002 -Return-path: -Received: from postgresql.org (postgresql.org [64.49.215.8]) - by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g3CKPkS03078 - for ; Fri, 12 Apr 2002 16:25:46 -0400 (EDT) -Received: from postgresql.org (postgresql.org [64.49.215.8]) - by postgresql.org (Postfix) with SMTP - id 9C3BD475CC6; Fri, 12 Apr 2002 16:25:42 -0400 (EDT) -Received: from klamath.dyndns.org (CPE002078144ae0.cpe.net.cable.rogers.com [24.102.202.35]) - by postgresql.org (Postfix) with ESMTP id B06D8475909 - for ; Fri, 12 Apr 2002 16:24:52 -0400 (EDT) -Received: from jiro (jiro [192.168.40.7]) - by klamath.dyndns.org (Postfix) with SMTP - id C05557013; Fri, 12 Apr 2002 16:24:53 -0400 (EDT) -Date: Fri, 12 Apr 2002 16:24:48 -0400 -From: Neil Conway -To: "Bruce Momjian" -cc: tgl@sss.pgh.pa.us, zakkr@zf.jcu.cz, pgsql-hackers@postgresql.org -Subject: Re: [HACKERS] 7.3 schedule -Message-ID: <20020412162448.4d46d747.nconway@klamath.dyndns.org> -In-Reply-To: <200204121621.g3CGL4310492@candle.pha.pa.us> -References: <27235.1018620866@sss.pgh.pa.us> - <200204121621.g3CGL4310492@candle.pha.pa.us> -X-Mailer: Sylpheed version 0.7.4 (GTK+ 1.2.10; i386-debian-linux-gnu) -MIME-Version: 1.0 -Content-Type: text/plain; charset=US-ASCII -Content-Transfer-Encoding: 7bit -Precedence: bulk -Sender: pgsql-hackers-owner@postgresql.org -Status: ORr - -On Fri, 12 Apr 2002 12:21:04 -0400 (EDT) -"Bruce Momjian" wrote: -> Tom Lane wrote: -> > A per-backend cache kept in local memory avoids all of these problems, -> > and I have seen no numbers to make me think that a shared plan cache -> > would achieve significantly more performance benefit than a local one. -> -> Certainly a shared cache would be good for apps that connect to issue a -> single query frequently. In such cases, there would be no local cache -> to use. - -One problem with this kind of scenario is: what to do if the plan no -longer exists for some reason? (e.g. the code that was supposed to be -PREPARE-ing your statements failed to execute properly, or the cached -plan has been evicted from shared memory, or the database was restarted, -etc.) -- EXECUTE in and of itself won't have enough information to do -anything useful. We could perhaps provide a means for an application -to test for the existence of a cached plan (in which case the -application developer will need to add logic to their application -to re-prepare the query if necessary, which could get complicated). - -Cheers, - -Neil - --- -Neil Conway -PGP Key ID: DB3C29FC - ----------------------------(end of broadcast)--------------------------- -TIP 5: Have you checked our extensive FAQ? - -http://www.postgresql.org/users-lounge/docs/faq.html - -From pgsql-hackers-owner+M21242@postgresql.org Fri Apr 12 17:27:24 2002 -Return-path: -Received: from postgresql.org (postgresql.org [64.49.215.8]) - by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g3CLRNS14410 - for ; Fri, 12 Apr 2002 17:27:23 -0400 (EDT) -Received: from postgresql.org (postgresql.org [64.49.215.8]) - by postgresql.org (Postfix) with SMTP - id E05A1475D30; Fri, 12 Apr 2002 17:26:40 -0400 (EDT) -Received: from candle.pha.pa.us (216-55-132-35.dsl.san-diego.abac.net [216.55.132.35]) - by postgresql.org (Postfix) with ESMTP id 36BBB475858 - for ; Fri, 12 Apr 2002 17:25:44 -0400 (EDT) -Received: (from pgman@localhost) - by candle.pha.pa.us (8.11.6/8.10.1) id g3CLPVa14231; - Fri, 12 Apr 2002 17:25:31 -0400 (EDT) -From: Bruce Momjian -Message-ID: <200204122125.g3CLPVa14231@candle.pha.pa.us> -Subject: Re: [HACKERS] 7.3 schedule -In-Reply-To: <20020412162448.4d46d747.nconway@klamath.dyndns.org> -To: Neil Conway -Date: Fri, 12 Apr 2002 17:25:31 -0400 (EDT) -cc: tgl@sss.pgh.pa.us, zakkr@zf.jcu.cz, pgsql-hackers@postgresql.org -X-Mailer: ELM [version 2.4ME+ PL97 (25)] -MIME-Version: 1.0 -Content-Transfer-Encoding: 7bit -Content-Type: text/plain; charset=US-ASCII -Precedence: bulk -Sender: pgsql-hackers-owner@postgresql.org -Status: OR - -Neil Conway wrote: -> On Fri, 12 Apr 2002 12:21:04 -0400 (EDT) -> "Bruce Momjian" wrote: -> > Tom Lane wrote: -> > > A per-backend cache kept in local memory avoids all of these problems, -> > > and I have seen no numbers to make me think that a shared plan cache -> > > would achieve significantly more performance benefit than a local one. -> > -> > Certainly a shared cache would be good for apps that connect to issue a -> > single query frequently. In such cases, there would be no local cache -> > to use. -> -> One problem with this kind of scenario is: what to do if the plan no -> longer exists for some reason? (e.g. the code that was supposed to be -> PREPARE-ing your statements failed to execute properly, or the cached -> plan has been evicted from shared memory, or the database was restarted, -> etc.) -- EXECUTE in and of itself won't have enough information to do -> anything useful. We could perhaps provide a means for an application -> to test for the existence of a cached plan (in which case the -> application developer will need to add logic to their application -> to re-prepare the query if necessary, which could get complicated). - -Oh, are you thinking that one backend would do the PREPARE and another -one the EXECUTE? I can't see that working at all. I thought there -would some way to quickly test if the submitted query was in the cache, -but maybe that is too much of a performance penalty to be worth it. - --- - Bruce Momjian | http://candle.pha.pa.us - pgman@candle.pha.pa.us | (610) 853-3000 - + If your life is a hard drive, | 830 Blythe Avenue - + Christ can be your backup. | Drexel Hill, Pennsylvania 19026 - ----------------------------(end of broadcast)--------------------------- -TIP 3: if posting/reading through Usenet, please send an appropriate -subscribe-nomail command to majordomo@postgresql.org so that your -message can get through to the mailing list cleanly - -From tgl@sss.pgh.pa.us Fri Apr 12 17:36:17 2002 -Return-path: -Received: from sss.pgh.pa.us (root@[192.204.191.242]) - by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g3CLaGS16061 - for ; Fri, 12 Apr 2002 17:36:17 -0400 (EDT) -Received: from sss2.sss.pgh.pa.us (tgl@localhost [127.0.0.1]) - by sss.pgh.pa.us (8.11.4/8.11.4) with ESMTP id g3CLaGF10813; - Fri, 12 Apr 2002 17:36:16 -0400 (EDT) -To: Bruce Momjian -cc: Neil Conway , zakkr@zf.jcu.cz, - pgsql-hackers@postgresql.org -Subject: Re: [HACKERS] 7.3 schedule -In-Reply-To: <200204122125.g3CLPVa14231@candle.pha.pa.us> -References: <200204122125.g3CLPVa14231@candle.pha.pa.us> -Comments: In-reply-to Bruce Momjian - message dated "Fri, 12 Apr 2002 17:25:31 -0400" -Date: Fri, 12 Apr 2002 17:36:16 -0400 -Message-ID: <10810.1018647376@sss.pgh.pa.us> -From: Tom Lane -Status: ORr - -Bruce Momjian writes: -> Oh, are you thinking that one backend would do the PREPARE and another -> one the EXECUTE? I can't see that working at all. - -Uh, why exactly were you advocating a shared cache then? Wouldn't that -be exactly the *point* of a shared cache? - - regards, tom lane - -From pgsql-hackers-owner+M21245@postgresql.org Fri Apr 12 17:39:13 2002 -Return-path: -Received: from postgresql.org (postgresql.org [64.49.215.8]) - by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g3CLdCS16515 - for ; Fri, 12 Apr 2002 17:39:12 -0400 (EDT) -Received: from postgresql.org (postgresql.org [64.49.215.8]) - by postgresql.org (Postfix) with SMTP - id A904B475E15; Fri, 12 Apr 2002 17:39:09 -0400 (EDT) -Received: from candle.pha.pa.us (216-55-132-35.dsl.san-diego.abac.net [216.55.132.35]) - by postgresql.org (Postfix) with ESMTP id B1A3F4758DE - for ; Fri, 12 Apr 2002 17:38:25 -0400 (EDT) -Received: (from pgman@localhost) - by candle.pha.pa.us (8.11.6/8.10.1) id g3CLcFX16347; - Fri, 12 Apr 2002 17:38:15 -0400 (EDT) -From: Bruce Momjian -Message-ID: <200204122138.g3CLcFX16347@candle.pha.pa.us> -Subject: Re: [HACKERS] 7.3 schedule -In-Reply-To: <10810.1018647376@sss.pgh.pa.us> -To: Tom Lane -Date: Fri, 12 Apr 2002 17:38:15 -0400 (EDT) -cc: Neil Conway , zakkr@zf.jcu.cz, - pgsql-hackers@postgresql.org -X-Mailer: ELM [version 2.4ME+ PL97 (25)] -MIME-Version: 1.0 -Content-Transfer-Encoding: 7bit -Content-Type: text/plain; charset=US-ASCII -Precedence: bulk -Sender: pgsql-hackers-owner@postgresql.org -Status: OR - -Tom Lane wrote: -> Bruce Momjian writes: -> > Oh, are you thinking that one backend would do the PREPARE and another -> > one the EXECUTE? I can't see that working at all. -> -> Uh, why exactly were you advocating a shared cache then? Wouldn't that -> be exactly the *point* of a shared cache? - -I thought it would somehow compare the SQL query string to the cached -plans and if it matched, it would use that plan rather than make a new -one. Any DDL statement would flush the cache. - --- - Bruce Momjian | http://candle.pha.pa.us - pgman@candle.pha.pa.us | (610) 853-3000 - + If your life is a hard drive, | 830 Blythe Avenue - + Christ can be your backup. | Drexel Hill, Pennsylvania 19026 - ----------------------------(end of broadcast)--------------------------- -TIP 5: Have you checked our extensive FAQ? - -http://www.postgresql.org/users-lounge/docs/faq.html - -From pgsql-hackers-owner+M21246@postgresql.org Fri Apr 12 17:56:58 2002 -Return-path: -Received: from postgresql.org (postgresql.org [64.49.215.8]) - by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g3CLuvS19021 - for ; Fri, 12 Apr 2002 17:56:58 -0400 (EDT) -Received: from postgresql.org (postgresql.org [64.49.215.8]) - by postgresql.org (Postfix) with SMTP - id 1B4D6475E2C; Fri, 12 Apr 2002 17:56:55 -0400 (EDT) -Received: from voyager.corporate.connx.com (unknown [209.20.248.131]) - by postgresql.org (Postfix) with ESMTP id 059F1475858 - for ; Fri, 12 Apr 2002 17:56:13 -0400 (EDT) -X-MimeOLE: Produced By Microsoft Exchange V6.0.4712.0 -content-class: urn:content-classes:message -MIME-Version: 1.0 -Content-Type: text/plain; - charset="iso-8859-1" -Subject: Re: [HACKERS] 7.3 schedule -Date: Fri, 12 Apr 2002 14:59:15 -0700 -Message-ID: -Thread-Topic: [HACKERS] 7.3 schedule -Thread-Index: AcHia2aODSpgXEd4Tluz/N0jN5fJOQAAC//w -From: "Dann Corbit" -To: "Bruce Momjian" , "Tom Lane" -cc: "Neil Conway" , , - -Precedence: bulk -Sender: pgsql-hackers-owner@postgresql.org -Content-Transfer-Encoding: 8bit -X-MIME-Autoconverted: from quoted-printable to 8bit by candle.pha.pa.us id g3CLuvS19021 -Status: OR - ------Original Message----- -From: Bruce Momjian [mailto:pgman@candle.pha.pa.us] -Sent: Friday, April 12, 2002 2:38 PM -To: Tom Lane -Cc: Neil Conway; zakkr@zf.jcu.cz; pgsql-hackers@postgresql.org -Subject: Re: [HACKERS] 7.3 schedule - - -Tom Lane wrote: -> Bruce Momjian writes: -> > Oh, are you thinking that one backend would do the PREPARE and -another -> > one the EXECUTE? I can't see that working at all. -> -> Uh, why exactly were you advocating a shared cache then? Wouldn't -that -> be exactly the *point* of a shared cache? - -I thought it would somehow compare the SQL query string to the cached -plans and if it matched, it would use that plan rather than make a new -one. Any DDL statement would flush the cache. ->>------------------------------------------------------------------- -Many applications will have similar queries coming from lots of -different end-users. Imagine an order-entry program where people are -ordering parts. Many of the queries might look like this: - -SELECT part_number FROM parts WHERE part_id = 12324 AND part_cost -< 12.95 - -In order to cache this query, we first parse it to replace the data -fields with paramter markers. -Then it looks like this: -SELECT part_number FROM parts WHERE part_id = ? AND part_cost < ? -{in the case of a 'LIKE' query or some other query where you can use -key information, you might have a symbolic replacement like this: -WHERE field LIKE '{D}%' to indicate that the key can be used} -Then, we make sure that the case is consistent by either capitalizing -the whole query or changing it all into lower case: -select part_number from parts where part_id = ? and part_cost < ? -Then, we run a checksum on the parameterized string. -The checksum might be used as a hash table key, where we keep some -additional information like how stale the entry is, and a pointer to -the actual parameterized SQL (in case the hash key has a collision -it would be simply wrong to run an incorrect query for obvious enough -reasons). -Now, if there are a huge number of users of the same application, it -makes sense that the probabilities of reusing queries goes up with -the number of users of the same application. Therefore, I would -advocate that the cache be kept in shared memory. - -Consider a single application with 100 different queries. Now, add -one user, ten users, 100 users, ... 10,000 users and you can see -that the benefit would be greater and greater as we add users. -<<------------------------------------------------------------------- - ----------------------------(end of broadcast)--------------------------- -TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org - -From pgsql-hackers-owner+M21270@postgresql.org Sat Apr 13 02:30:47 2002 -Return-path: -Received: from postgresql.org (postgresql.org [64.49.215.8]) - by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g3D6UkS07169 - for ; Sat, 13 Apr 2002 02:30:46 -0400 (EDT) -Received: from postgresql.org (postgresql.org [64.49.215.8]) - by postgresql.org (Postfix) with SMTP - id 23FEC475D1E; Sat, 13 Apr 2002 02:30:38 -0400 (EDT) -Received: from mail.iinet.net.au (symphony-01.iinet.net.au [203.59.3.33]) - by postgresql.org (Postfix) with SMTP id A08A4475C6C - for ; Sat, 13 Apr 2002 02:29:37 -0400 (EDT) -Received: (qmail 11594 invoked by uid 666); 13 Apr 2002 06:29:36 -0000 -Received: from unknown (HELO SOL) (203.59.103.193) - by mail.iinet.net.au with SMTP; 13 Apr 2002 06:29:36 -0000 -Message-ID: <002301c1e2b3$804bd000$0200a8c0@SOL> -From: "Christopher Kings-Lynne" -To: "Barry Lind" , "Tom Lane" -cc: "Karel Zak" , , - "Neil Conway" -References: <3CB52C54.4020507@freaky-namuh.com> <20020411115434.201ff92f.nconway@klamath.dyndns.org> <3CB61DAB.5010601@freaky-namuh.com> <24184.1018581907@sss.pgh.pa.us> <3CB65B49.93F2F790@tpf.co.jp> <20020412004134.5d35a2dd.nconway@klamath.dyndns.org> <20020412095116.B6370@zf.jcu.cz> <27235.1018620866@sss.pgh.pa.us> <3CB70E7C.3090801@xythos.com> -Subject: Re: [HACKERS] 7.3 schedule -Date: Sat, 13 Apr 2002 14:21:50 +0800 -MIME-Version: 1.0 -Content-Type: text/plain; - charset="iso-8859-1" -Content-Transfer-Encoding: 7bit -X-Priority: 3 -X-MSMail-Priority: Normal -X-Mailer: Microsoft Outlook Express 5.50.4522.1200 -X-MimeOLE: Produced By Microsoft MimeOLE V5.50.4522.1200 -Precedence: bulk -Sender: pgsql-hackers-owner@postgresql.org -Status: OR - -> > thought out way of predicting/limiting their size. (2) How the heck do -> > you get rid of obsoleted cached plans, if the things stick around in -> > shared memory even after you start a new backend? (3) A shared cache -> > requires locking; contention among multiple backends to access that -> > shared resource could negate whatever performance benefit you might hope -> > to realize from it. - -I don't understand all these locking problems? Surely the only lock a -transaction would need on a stored query is one that prevents the cache -invalidation mechanism from deleting it out from under it? Surely this -means that there would be tonnes of readers on the cache - none of them -blocking each other, and the odd invalidation event that needs a complete -lock? - -Also, as for invalidation, there probably could be just two reasons to -invalidate a query in the cache. (1) The cache is running out of space and -you use LRU or something to remove old queries, or (2) someone runs ANALYZE, -in which case all cached queries should just be flushed? If they specify an -actual table to analyze, then just drop all queries on the table. - -Could this cache mechanism be used to make views fast as well? You could -cache the queries that back views on first use, and then they can follow the -above rules for flushing... - -Chris - - - ----------------------------(end of broadcast)--------------------------- -TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org - -From pgsql-hackers-owner+M21276@postgresql.org Sat Apr 13 11:48:51 2002 -Return-path: -Received: from postgresql.org (postgresql.org [64.49.215.8]) - by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g3DFmoS27879 - for ; Sat, 13 Apr 2002 11:48:51 -0400 (EDT) -Received: from postgresql.org (postgresql.org [64.49.215.8]) - by postgresql.org (Postfix) with SMTP - id 9EB81475C5C; Sat, 13 Apr 2002 11:46:52 -0400 (EDT) -Received: from sss.pgh.pa.us (unknown [192.204.191.242]) - by postgresql.org (Postfix) with ESMTP id 0FE0B474E78 - for ; Sat, 13 Apr 2002 11:46:09 -0400 (EDT) -Received: from sss2.sss.pgh.pa.us (tgl@localhost [127.0.0.1]) - by sss.pgh.pa.us (8.11.4/8.11.4) with ESMTP id g3DFk2F15743; - Sat, 13 Apr 2002 11:46:02 -0400 (EDT) -To: "Christopher Kings-Lynne" -cc: "Barry Lind" , "Karel Zak" , - pgsql-hackers@postgresql.org, "Neil Conway" -Subject: Re: [HACKERS] 7.3 schedule -In-Reply-To: <002301c1e2b3$804bd000$0200a8c0@SOL> -References: <3CB52C54.4020507@freaky-namuh.com> <20020411115434.201ff92f.nconway@klamath.dyndns.org> <3CB61DAB.5010601@freaky-namuh.com> <24184.1018581907@sss.pgh.pa.us> <3CB65B49.93F2F790@tpf.co.jp> <20020412004134.5d35a2dd.nconway@klamath.dyndns.org> <20020412095116.B6370@zf.jcu.cz> <27235.1018620866@sss.pgh.pa.us> <3CB70E7C.3090801@xythos.com> <002301c1e2b3$804bd000$0200a8c0@SOL> -Comments: In-reply-to "Christopher Kings-Lynne" - message dated "Sat, 13 Apr 2002 14:21:50 +0800" -Date: Sat, 13 Apr 2002 11:46:01 -0400 -Message-ID: <15740.1018712761@sss.pgh.pa.us> -From: Tom Lane -Precedence: bulk -Sender: pgsql-hackers-owner@postgresql.org -Status: OR - -"Christopher Kings-Lynne" writes: -> thought out way of predicting/limiting their size. (2) How the heck do -> you get rid of obsoleted cached plans, if the things stick around in -> shared memory even after you start a new backend? (3) A shared cache -> requires locking; contention among multiple backends to access that -> shared resource could negate whatever performance benefit you might hope -> to realize from it. - -> I don't understand all these locking problems? - -Searching the cache and inserting/deleting entries in the cache probably -have to be mutually exclusive; concurrent insertions probably won't work -either (at least not without a remarkably intelligent data structure). -Unless the cache hit rate is remarkably high, there are going to be lots -of insertions --- and, at steady state, an equal rate of deletions --- -leading to lots of contention. - -This could possibly be avoided if the cache is not used for all query -plans but only for explicitly PREPAREd plans, so that only explicit -EXECUTEs would need to search it. But that approach also makes a -sizable dent in the usefulness of the cache to begin with. - - regards, tom lane - ----------------------------(end of broadcast)--------------------------- -TIP 2: you can get off all lists at once with the unregister command - (send "unregister YourEmailAddressHere" to majordomo@postgresql.org) - -From pgsql-hackers-owner+M21280@postgresql.org Sat Apr 13 14:36:34 2002 -Return-path: -Received: from postgresql.org (postgresql.org [64.49.215.8]) - by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g3DIaYS10293 - for ; Sat, 13 Apr 2002 14:36:34 -0400 (EDT) -Received: from postgresql.org (postgresql.org [64.49.215.8]) - by postgresql.org (Postfix) with SMTP - id AA151475BB1; Sat, 13 Apr 2002 14:36:17 -0400 (EDT) -Received: from klamath.dyndns.org (CPE002078144ae0.cpe.net.cable.rogers.com [24.102.202.35]) - by postgresql.org (Postfix) with ESMTP id 42993475BCB - for ; Sat, 13 Apr 2002 14:35:42 -0400 (EDT) -Received: from jiro (jiro [192.168.40.7]) - by klamath.dyndns.org (Postfix) with SMTP - id 82B84700C; Sat, 13 Apr 2002 14:35:42 -0400 (EDT) -Date: Sat, 13 Apr 2002 14:35:39 -0400 -From: Neil Conway -To: "Christopher Kings-Lynne" -cc: barry@xythos.com, tgl@sss.pgh.pa.us, zakkr@zf.jcu.cz, - pgsql-hackers@postgresql.org -Subject: Re: [HACKERS] 7.3 schedule -Message-ID: <20020413143539.7818bf7d.nconway@klamath.dyndns.org> -In-Reply-To: <002301c1e2b3$804bd000$0200a8c0@SOL> -References: - <3CB52C54.4020507@freaky-namuh.com> - <20020411115434.201ff92f.nconway@klamath.dyndns.org> - <3CB61DAB.5010601@freaky-namuh.com> - <24184.1018581907@sss.pgh.pa.us> - <3CB65B49.93F2F790@tpf.co.jp> - <20020412004134.5d35a2dd.nconway@klamath.dyndns.org> - <20020412095116.B6370@zf.jcu.cz> - <27235.1018620866@sss.pgh.pa.us> - <3CB70E7C.3090801@xythos.com> - <002301c1e2b3$804bd000$0200a8c0@SOL> -X-Mailer: Sylpheed version 0.7.4 (GTK+ 1.2.10; i386-debian-linux-gnu) -MIME-Version: 1.0 -Content-Type: text/plain; charset=US-ASCII -Content-Transfer-Encoding: 7bit -Precedence: bulk -Sender: pgsql-hackers-owner@postgresql.org -Status: OR - -On Sat, 13 Apr 2002 14:21:50 +0800 -"Christopher Kings-Lynne" wrote: -> Could this cache mechanism be used to make views fast as well? - -The current PREPARE/EXECUTE code will speed up queries that use -rules of any kind, including views: the query plan is cached after -it has been rewritten as necessary, so (AFAIK) this should mean -that rules will be evaluated once when the query is PREPAREd, and -then cached for subsequent EXECUTE commands. - -Cheers, - -Neil - --- -Neil Conway -PGP Key ID: DB3C29FC - ----------------------------(end of broadcast)--------------------------- -TIP 2: you can get off all lists at once with the unregister command - (send "unregister YourEmailAddressHere" to majordomo@postgresql.org) - -From pgsql-hackers-owner+M21309@postgresql.org Sun Apr 14 15:22:44 2002 -Return-path: -Received: from postgresql.org (postgresql.org [64.49.215.8]) - by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g3EJMiS24239 - for ; Sun, 14 Apr 2002 15:22:44 -0400 (EDT) -Received: from postgresql.org (postgresql.org [64.49.215.8]) - by postgresql.org (Postfix) with SMTP - id 44BAC475E05; Sun, 14 Apr 2002 15:22:42 -0400 (EDT) -Received: from ara.zf.jcu.cz (ara.zf.jcu.cz [160.217.161.4]) - by postgresql.org (Postfix) with ESMTP id 3CD03475925 - for ; Sun, 14 Apr 2002 15:21:58 -0400 (EDT) -Received: from ara.zf.jcu.cz (LOCALHOST [127.0.0.1]) - by ara.zf.jcu.cz (8.12.1/8.12.1/Debian -5) with ESMTP id g3EJLiBK012612; - Sun, 14 Apr 2002 21:21:44 +0200 -Received: (from zakkr@localhost) - by ara.zf.jcu.cz (8.12.1/8.12.1/Debian -5) id g3EJLi3k012611; - Sun, 14 Apr 2002 21:21:44 +0200 -Date: Sun, 14 Apr 2002 21:21:44 +0200 -From: Karel Zak -To: Tom Lane -cc: Bruce Momjian , pgsql-hackers@postgresql.org, - Neil Conway -Subject: Re: [HACKERS] 7.3 schedule -Message-ID: <20020414212144.A12196@zf.jcu.cz> -References: <200204121621.g3CGL4310492@candle.pha.pa.us> <27964.1018630286@sss.pgh.pa.us> -MIME-Version: 1.0 -Content-Type: text/plain; charset=us-ascii -Content-Disposition: inline -User-Agent: Mutt/1.2.5i -In-Reply-To: <27964.1018630286@sss.pgh.pa.us>; from tgl@sss.pgh.pa.us on Fri, Apr 12, 2002 at 12:51:26PM -0400 -Precedence: bulk -Sender: pgsql-hackers-owner@postgresql.org -Status: OR - -On Fri, Apr 12, 2002 at 12:51:26PM -0400, Tom Lane wrote: -> Bruce Momjian writes: -> > Certainly a shared cache would be good for apps that connect to issue a -> > single query frequently. In such cases, there would be no local cache -> > to use. -> -> We have enough other problems with the single-query-per-connection -> scenario that I see no reason to believe that a shared plan cache will -> help materially. The correct answer for those folks will *always* be -> to find a way to reuse the connection. - - My query cache was write for 7.0. If some next release will use - pre-forked backend and after a client disconnection the backend will - still alives and waits for new client the shared cache is (maybe:-) not - needful. The current backend fork model is killer of all possible - caching. - - We have more caches. I hope persistent backend help will help to all - and I'm sure that speed will grow up with persistent backend and - persistent caches without shared memory usage. There I can agree with - Tom :-) - - Karel - --- - Karel Zak - http://home.zf.jcu.cz/~zakkr/ - - C, PostgreSQL, PHP, WWW, http://docs.linux.cz, http://mape.jcu.cz - ----------------------------(end of broadcast)--------------------------- -TIP 4: Don't 'kill -9' the postmaster - -From pgsql-hackers-owner+M21321@postgresql.org Sun Apr 14 20:40:08 2002 -Return-path: -Received: from postgresql.org (postgresql.org [64.49.215.8]) - by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g3F0e7S29723 - for ; Sun, 14 Apr 2002 20:40:07 -0400 (EDT) -Received: from postgresql.org (postgresql.org [64.49.215.8]) - by postgresql.org (Postfix) with SMTP - id 3B5FB475DC5; Sun, 14 Apr 2002 20:40:03 -0400 (EDT) -Received: from localhost.localdomain (bgp01077650bgs.wanarb01.mi.comcast.net [68.40.135.112]) - by postgresql.org (Postfix) with ESMTP id 7B1D3474E71 - for ; Sun, 14 Apr 2002 20:39:18 -0400 (EDT) -Received: from localhost (camber@localhost) - by localhost.localdomain (8.11.6/8.11.6) with ESMTP id g3F0cmD10631; - Sun, 14 Apr 2002 20:38:48 -0400 -X-Authentication-Warning: localhost.localdomain: camber owned process doing -bs -Date: Sun, 14 Apr 2002 20:38:48 -0400 (EDT) -From: Brian Bruns -X-X-Sender: -To: Hannu Krosing -cc: -Subject: Re: [HACKERS] 7.3 schedule -In-Reply-To: <1018704763.1784.1.camel@taru.tm.ee> -Message-ID: -MIME-Version: 1.0 -Content-Type: TEXT/PLAIN; charset=US-ASCII -Precedence: bulk -Sender: pgsql-hackers-owner@postgresql.org -Status: OR - -On 13 Apr 2002, Hannu Krosing wrote: - -> On Fri, 2002-04-12 at 03:04, Brian Bruns wrote: -> > On 11 Apr 2002, Hannu Krosing wrote: -> > -> > > IIRC someone started work on modularising the network-related parts with -> > > a goal of supporting DRDA (DB2 protocol) and others in future. -> > -> > That was me, although I've been bogged down lately, and haven't been able -> > to get back to it. -> -> Has any of your modularisation work got into CVS yet ? - -No, Bruce didn't like the way I did certain things, and had some qualms -about the value of supporting multiple wire protocols IIRC. Plus the -patch was not really ready for primetime yet. - -I'm hoping to get back to it soon and sync it with the latest CVS, and -clean up the odds and ends. - -> > DRDA, btw, is not just a DB2 protocol but an opengroup -> > spec that hopefully will someday be *the* standard on the wire database -> > protocol. DRDA handles prepare/execute and is completely binary in -> > representation, among other advantages. -> -> What about extensibility - is there some predefined way of adding new -> types ? - -Not really, there is some ongoing standards activity adding some new -features. The list of supported types is pretty impressive, anything in -particular you are looking for? - -> Also, does it handle NOTIFY ? - -I don't know the answer to this. The spec is pretty huge, so it may, but -I haven't seen it. - -Even if it is supported as a secondary protocol, I believe there is alot -of value in having a single database protocol standard. (why else would I -be doing it!). I'm also looking into what it will take to do the same for -MySQL and Firebird. Hopefully they will be receptive to the idea as well. - -> ---------------- -> Hannu - -Cheers, - -Brian - - ----------------------------(end of broadcast)--------------------------- -TIP 5: Have you checked our extensive FAQ? - -http://www.postgresql.org/users-lounge/docs/faq.html - diff --git a/doc/TODO.detail/replication b/doc/TODO.detail/replication deleted file mode 100644 index d4bf4b1fe2..0000000000 --- a/doc/TODO.detail/replication +++ /dev/null @@ -1,6241 +0,0 @@ -From goran@kirra.net Mon Dec 20 14:30:54 1999 -Received: from villa.bildbasen.se (villa.bildbasen.se [193.45.225.97]) - by candle.pha.pa.us (8.9.0/8.9.0) with SMTP id PAA29058 - for ; Mon, 20 Dec 1999 15:30:17 -0500 (EST) -Received: (qmail 2485 invoked from network); 20 Dec 1999 20:29:53 -0000 -Received: from a112.dial.kiruna.se (HELO kirra.net) (193.45.238.12) - by villa.bildbasen.se with SMTP; 20 Dec 1999 20:29:53 -0000 -Sender: goran -Message-ID: <385E9192.226CC37D@kirra.net> -Date: Mon, 20 Dec 1999 21:29:06 +0100 -From: Goran Thyni -Organization: kirra.net -X-Mailer: Mozilla 4.6 [en] (X11; U; Linux 2.2.13 i586) -X-Accept-Language: sv, en -MIME-Version: 1.0 -To: Bruce Momjian -CC: "neil d. quiogue" , - PostgreSQL-development -Subject: Re: [HACKERS] Re: QUESTION: Replication -References: <199912201508.KAA20572@candle.pha.pa.us> -Content-Type: text/plain; charset=iso-8859-1 -Content-Transfer-Encoding: 8bit -Status: OR - -Bruce Momjian wrote: -> We need major work in this area, or at least a plan and an FAQ item. -> We are getting major questions on this, and I don't know enough even to -> make an FAQ item telling people their options. - -My 2 cents, or 2 ören since I'm a Swede, on this: - -It is pretty simple to build a replication with pg_dump, transfer, -empty replic and reload. -But if we want "live replicas" we better base our efforts on a -mechanism using WAL-logs to rollforward the replicas. - -regards, ------------------ -Göran Thyni -On quiet nights you can hear Windows NT reboot! - -From owner-pgsql-hackers@hub.org Fri Dec 24 10:01:18 1999 -Received: from renoir.op.net (root@renoir.op.net [207.29.195.4]) - by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id LAA11295 - for ; Fri, 24 Dec 1999 11:01:17 -0500 (EST) -Received: from hub.org (hub.org [216.126.84.1]) by renoir.op.net (o1/$Revision: 1.10 $) with ESMTP id KAA20310 for ; Fri, 24 Dec 1999 10:39:18 -0500 (EST) -Received: from localhost (majordom@localhost) - by hub.org (8.9.3/8.9.3) with SMTP id KAA61760; - Fri, 24 Dec 1999 10:31:13 -0500 (EST) - (envelope-from owner-pgsql-hackers) -Received: by hub.org (bulk_mailer v1.5); Fri, 24 Dec 1999 10:30:48 -0500 -Received: (from majordom@localhost) - by hub.org (8.9.3/8.9.3) id KAA58879 - for pgsql-hackers-outgoing; Fri, 24 Dec 1999 10:29:51 -0500 (EST) - (envelope-from owner-pgsql-hackers@postgreSQL.org) -Received: from bocs170n.black-oak.COM ([38.149.137.131]) - by hub.org (8.9.3/8.9.3) with ESMTP id KAA58795 - for ; Fri, 24 Dec 1999 10:29:00 -0500 (EST) - (envelope-from DWalker@black-oak.com) -From: DWalker@black-oak.com -To: pgsql-hackers@postgreSQL.org -Subject: [HACKERS] database replication -Date: Fri, 24 Dec 1999 10:27:59 -0500 -Message-ID: -X-Priority: 3 (Normal) -X-MIMETrack: Serialize by Router on notes01n/BOCS(Release 5.0.1|July 16, 1999) at 12/24/99 - 10:28:01 AM -MIME-Version: 1.0 -MIME-Version: 1.0 -Content-Type: text/html; charset=ISO-8859-1 -Content-Transfer-Encoding: quoted-printable -Sender: owner-pgsql-hackers@postgreSQL.org -Status: OR - -

I've been toying with the idea of implementing database replication for = -the last few days.  The system I'm proposing will be a seperate progra= -m which can be run on any machine and will most likely be implemented in Py= -thon.  What I'm looking for at this point are gaping holes in my think= -ing/logic/etc.  Here's what I'm thinking...

 

1) I wa= -nt to make this program an additional layer over PostgreSQL.  I really= - don't want to hack server code if I can get away with it.  At this po= -int I don't feel I need to.

2) The replication system will need to ad= -d at least one field to each table in each database that needs to be replic= -ated.  This field will be a date/time stamp which identifies the "= -;last update" of the record.  This field will be called PGR=5FTIM= -E for lack of a better name.  Because this field will be used from wit= -hin programs and triggers it can be longer so as to not mistake it for a us= -er field.

3) For each table to be replicated the replication system w= -ill programatically add one plpgsql function and trigger to modify the PGR= -=5FTIME field on both UPDATEs and INSERTs.  The name of this function = -and trigger will be along the lines of <table=5Fname>=5Freplication= -=5Fupdate=5Ftrigger and <table=5Fname>=5Freplication=5Fupdate=5Ffunct= -ion.  The function is a simple two-line chunk of code to set the field= - PGR=5FTIME equal to NOW.  The trigger is called before each insert/up= -date.  When looking at the Docs I see that times are stored in Zulu (G= -T) time.  Because of this I don't have to worry about time zones and t= -he like.  I need direction on this part (such as "hey dummy, look= - at page N of file X.").

4) At this point we have tables which c= -an, at a basic level, tell the replication system when they were last updat= -ed.

5) The replication system will have a database of its own to reco= -rd the last replication event, hold configuration, logs, etc.  I'd pre= -fer to store the configuration in a PostgreSQL table but it could just as e= -asily be stored in a text file on the filesystem somewhere.

6) To han= -dle replication I basically check the local "last replication time&quo= -t; and compare it against the remote PGR=5FTIME fields.  If the remote= - PGR=5FTIME is greater than the last replication time then change the local= - copy of the database, otherwise, change the remote end of the database. &n= -bsp;At this point I don't have a way to know WHICH field changed between th= -e two replicas so either I do ROW level replication or I check each field. = - I check PGR=5FTIME to determine which field is the most current. &nbs= -p;Some fine tuning of this process will have to occur no doubt.

7) Th= -e commandline utility, fired off by something like cron, could run several = -times during the day -- command line parameters can be implemented to say P= -USH ALL CHANGES TO SERVER A, or PULL ALL CHANGES FROM SERVER B.

 = -;

Questions/Concerns:

1) How far do I go with this?  Do I = -start manhandling the system catalogs (pg=5F* tables)?

2) As to #2 an= -d #3 above, I really don't like tools automagically changing my tables but = -at this point I don't see a way around it.  I guess this is where the = -testing comes into play.

3) Security: the replication app will have t= -o have pretty good rights to the database so it can add the nessecary funct= -ions and triggers, modify table schema, etc.  

 

&nbs= -p; So, any "you're insane and should run home to momma" comments?= -

 

              Damond= -

= - -************ - -From owner-pgsql-hackers@hub.org Fri Dec 24 18:31:03 1999 -Received: from renoir.op.net (root@renoir.op.net [207.29.195.4]) - by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id TAA26244 - for ; Fri, 24 Dec 1999 19:31:02 -0500 (EST) -Received: from hub.org (hub.org [216.126.84.1]) by renoir.op.net (o1/$Revision: 1.10 $) with ESMTP id TAA12730 for ; Fri, 24 Dec 1999 19:30:05 -0500 (EST) -Received: from localhost (majordom@localhost) - by hub.org (8.9.3/8.9.3) with SMTP id TAA57851; - Fri, 24 Dec 1999 19:23:31 -0500 (EST) - (envelope-from owner-pgsql-hackers) -Received: by hub.org (bulk_mailer v1.5); Fri, 24 Dec 1999 19:22:54 -0500 -Received: (from majordom@localhost) - by hub.org (8.9.3/8.9.3) id TAA57710 - for pgsql-hackers-outgoing; Fri, 24 Dec 1999 19:21:56 -0500 (EST) - (envelope-from owner-pgsql-hackers@postgreSQL.org) -Received: from Mail.austin.rr.com (sm2.texas.rr.com [24.93.35.55]) - by hub.org (8.9.3/8.9.3) with ESMTP id TAA57680 - for ; Fri, 24 Dec 1999 19:21:25 -0500 (EST) - (envelope-from ELOEHR@austin.rr.com) -Received: from austin.rr.com ([24.93.40.248]) by Mail.austin.rr.com with Microsoft SMTPSVC(5.5.1877.197.19); - Fri, 24 Dec 1999 18:12:50 -0600 -Message-ID: <38640E2D.75136600@austin.rr.com> -Date: Fri, 24 Dec 1999 18:22:05 -0600 -From: Ed Loehr -X-Mailer: Mozilla 4.7 [en] (X11; U; Linux 2.2.12-20smp i686) -X-Accept-Language: en -MIME-Version: 1.0 -To: DWalker@black-oak.com -CC: pgsql-hackers@postgreSQL.org -Subject: Re: [HACKERS] database replication -References: -Content-Type: text/plain; charset=us-ascii -Content-Transfer-Encoding: 7bit -Sender: owner-pgsql-hackers@postgreSQL.org -Status: OR - -DWalker@black-oak.com wrote: - -> 6) To handle replication I basically check the local "last -> replication time" and compare it against the remote PGR_TIME -> fields. If the remote PGR_TIME is greater than the last replication -> time then change the local copy of the database, otherwise, change -> the remote end of the database. At this point I don't have a way to -> know WHICH field changed between the two replicas so either I do ROW -> level replication or I check each field. I check PGR_TIME to -> determine which field is the most current. Some fine tuning of this -> process will have to occur no doubt. - -Interesting idea. I can see how this might sync up two databases -somehow. For true replication, however, I would always want every -replicated database to be, at the very least, internally consistent -(i.e., referential integrity), even if it was a little behind on -processing transactions. In this method, its not clear how -consistency is every achieved/guaranteed at any point in time if the -input stream of changes is continuous. If the input stream ceased, -then I can see how this approach might eventually catch up and totally -resync everything, but it looks *very* computationally expensive. - -But I might have missed something. How would internal consistency be -maintained? - - -> 7) The commandline utility, fired off by something like cron, could -> run several times during the day -- command line parameters can be -> implemented to say PUSH ALL CHANGES TO SERVER A, or PULL ALL CHANGES -> FROM SERVER B. - -My two cents is that, while I can see this kind of database syncing as -valuable, this is not the kind of "replication" I had in mind. This -may already possible by simply copying the database. What replication -means to me is a live, continuously streaming sequence of updates from -one database to another where the replicated database is always -internally consistent, available for read-only queries, and never "too -far" out of sync with the source/primary database. - -What does replication mean to others? - -Cheers, -Ed Loehr - - - -************ - -From owner-pgsql-hackers@hub.org Fri Dec 24 21:31:10 1999 -Received: from renoir.op.net (root@renoir.op.net [207.29.195.4]) - by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id WAA02578 - for ; Fri, 24 Dec 1999 22:31:09 -0500 (EST) -Received: from hub.org (hub.org [216.126.84.1]) by renoir.op.net (o1/$Revision: 1.10 $) with ESMTP id WAA16641 for ; Fri, 24 Dec 1999 22:18:56 -0500 (EST) -Received: from localhost (majordom@localhost) - by hub.org (8.9.3/8.9.3) with SMTP id WAA89135; - Fri, 24 Dec 1999 22:11:12 -0500 (EST) - (envelope-from owner-pgsql-hackers) -Received: by hub.org (bulk_mailer v1.5); Fri, 24 Dec 1999 22:10:56 -0500 -Received: (from majordom@localhost) - by hub.org (8.9.3/8.9.3) id WAA89019 - for pgsql-hackers-outgoing; Fri, 24 Dec 1999 22:09:59 -0500 (EST) - (envelope-from owner-pgsql-hackers@postgreSQL.org) -Received: from bocs170n.black-oak.COM ([38.149.137.131]) - by hub.org (8.9.3/8.9.3) with ESMTP id WAA88957; - Fri, 24 Dec 1999 22:09:11 -0500 (EST) - (envelope-from dwalker@black-oak.com) -Received: from gcx80 ([151.196.99.113]) - by bocs170n.black-oak.COM (Lotus Domino Release 5.0.1) - with SMTP id 1999122422080835:6 ; - Fri, 24 Dec 1999 22:08:08 -0500 -Message-ID: <001b01bf4e9e$647287d0$af63a8c0@walkers.org> -From: "Damond Walker" -To: -Cc: -References: <38640E2D.75136600@austin.rr.com> -Subject: Re: [HACKERS] database replication -Date: Fri, 24 Dec 1999 22:07:55 -0800 -MIME-Version: 1.0 -X-Priority: 3 (Normal) -X-MSMail-Priority: Normal -X-Mailer: Microsoft Outlook Express 5.00.2314.1300 -X-MimeOLE: Produced By Microsoft MimeOLE V5.00.2314.1300 -X-MIMETrack: Itemize by SMTP Server on notes01n/BOCS(Release 5.0.1|July 16, 1999) at 12/24/99 - 10:08:09 PM, - Serialize by Router on notes01n/BOCS(Release 5.0.1|July 16, 1999) at 12/24/99 - 10:08:11 PM, - Serialize complete at 12/24/99 10:08:11 PM -Content-Transfer-Encoding: 7bit -Content-Type: text/plain; - charset="iso-8859-1" -Sender: owner-pgsql-hackers@postgreSQL.org -Status: OR - -> -> Interesting idea. I can see how this might sync up two databases -> somehow. For true replication, however, I would always want every -> replicated database to be, at the very least, internally consistent -> (i.e., referential integrity), even if it was a little behind on -> processing transactions. In this method, its not clear how -> consistency is every achieved/guaranteed at any point in time if the -> input stream of changes is continuous. If the input stream ceased, -> then I can see how this approach might eventually catch up and totally -> resync everything, but it looks *very* computationally expensive. -> - - What's the typical unit of work for the database? Are we talking about -update transactions which span the entire DB? Or are we talking about -updating maybe 1% or less of the database everyday? I'd think it would be -more towards the latter than the former. So, yes, this process would be -computationally expensive but how many records would actually have to be -sent back and forth? - -> But I might have missed something. How would internal consistency be -> maintained? -> - - Updates that occur at site A will be moved to site B and vice versa. -Consistency would be maintained. The only problem that I can see right off -the bat would be what if site A and site B made changes to a row and then -site C was brought into the picture? Which one wins? - - Someone *has* to win when it comes to this type of thing. You really -DON'T want to start merging row changes... - -> -> My two cents is that, while I can see this kind of database syncing as -> valuable, this is not the kind of "replication" I had in mind. This -> may already possible by simply copying the database. What replication -> means to me is a live, continuously streaming sequence of updates from -> one database to another where the replicated database is always -> internally consistent, available for read-only queries, and never "too -> far" out of sync with the source/primary database. -> - - Sounds like you're talking about distributed transactions to me. That's -an entirely different subject all-together. What you describe can be done -by copying a database...but as you say, this would only work in a read-only -situation. - - - Damond - - -************ - -From owner-pgsql-hackers@hub.org Sat Dec 25 16:35:07 1999 -Received: from hub.org (hub.org [216.126.84.1]) - by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id RAA28890 - for ; Sat, 25 Dec 1999 17:35:05 -0500 (EST) -Received: from localhost (majordom@localhost) - by hub.org (8.9.3/8.9.3) with SMTP id RAA86997; - Sat, 25 Dec 1999 17:29:10 -0500 (EST) - (envelope-from owner-pgsql-hackers) -Received: by hub.org (bulk_mailer v1.5); Sat, 25 Dec 1999 17:28:09 -0500 -Received: (from majordom@localhost) - by hub.org (8.9.3/8.9.3) id RAA86863 - for pgsql-hackers-outgoing; Sat, 25 Dec 1999 17:27:11 -0500 (EST) - (envelope-from owner-pgsql-hackers@postgreSQL.org) -Received: from mtiwmhc08.worldnet.att.net (mtiwmhc08.worldnet.att.net [204.127.131.19]) - by hub.org (8.9.3/8.9.3) with ESMTP id RAA86798 - for ; Sat, 25 Dec 1999 17:26:34 -0500 (EST) - (envelope-from pgsql@rkirkpat.net) -Received: from [192.168.3.100] ([12.74.72.219]) - by mtiwmhc08.worldnet.att.net (InterMail v03.02.07.07 118-134) - with ESMTP id <19991225222554.VIOL28505@[12.74.72.219]>; - Sat, 25 Dec 1999 22:25:54 +0000 -Date: Sat, 25 Dec 1999 15:25:47 -0700 (MST) -From: Ryan Kirkpatrick -X-Sender: rkirkpat@excelsior.rkirkpat.net -To: DWalker@black-oak.com -cc: pgsql-hackers@postgreSQL.org -Subject: Re: [HACKERS] database replication -In-Reply-To: -Message-ID: -MIME-Version: 1.0 -Content-Type: TEXT/PLAIN; charset=US-ASCII -Sender: owner-pgsql-hackers@postgreSQL.org -Status: OR - -On Fri, 24 Dec 1999 DWalker@black-oak.com wrote: - -> I've been toying with the idea of implementing database replication -> for the last few days. - - I too have been thinking about this some over the last year or -two, just trying to find a quick and easy way to do it. I am not so -interested in replication, as in synchronization, as in between a desktop -machine and a laptop, so I can keep the databases on each in sync with -each other. For this sort of purpose, both the local and remote databases -would be "idle" at the time of syncing. - -> 2) The replication system will need to add at least one field to each -> table in each database that needs to be replicated. This field will be -> a date/time stamp which identifies the "last update" of the record. -> This field will be called PGR_TIME for lack of a better name. -> Because this field will be used from within programs and triggers it -> can be longer so as to not mistake it for a user field. - - How about a single, seperate table with the fields of 'database', -'tablename', 'oid', 'last_changed', that would store the same data as your -PGR_TIME field. It would be seperated from the actually data tables, and -therefore would be totally transparent to any database interface -applications. The 'oid' field would hold each row's OID, a nice, unique -identification number for the row, while the other fields would tell which -table and database the oid is in. Then this table can be compared with the -this table on a remote machine to quickly find updates and changes, then -each differences can be dealt with in turn. - -> 3) For each table to be replicated the replication system will -> programatically add one plpgsql function and trigger to modify the -> PGR_TIME field on both UPDATEs and INSERTs. The name of this function -> and trigger will be along the lines of -> _replication_update_trigger and -> _replication_update_function. The function is a simple -> two-line chunk of code to set the field PGR_TIME equal to NOW. The -> trigger is called before each insert/update. When looking at the Docs -> I see that times are stored in Zulu (GT) time. Because of this I -> don't have to worry about time zones and the like. I need direction -> on this part (such as "hey dummy, look at page N of file X."). - - I like this idea, better than any I have come up with yet. Though, -how are you going to handle DELETEs? - -> 6) To handle replication I basically check the local "last replication -> time" and compare it against the remote PGR_TIME fields. If the -> remote PGR_TIME is greater than the last replication time then change -> the local copy of the database, otherwise, change the remote end of -> the database. At this point I don't have a way to know WHICH field -> changed between the two replicas so either I do ROW level replication -> or I check each field. I check PGR_TIME to determine which field is -> the most current. Some fine tuning of this process will have to occur -> no doubt. - - Yea, this is indeed the sticky part, and would indeed require some -fine-tunning. Basically, the way I see it, is if the two timestamps for a -single row do not match (or even if the row and therefore timestamp is -missing on one side or the other altogether): - local ts > remote ts => Local row is exported to remote. - remote ts > local ts => Remote row is exported to local. - local ts > last sync time && no remote ts => - Local row is inserted on remote. - local ts < last sync time && no remote ts => - Local row is deleted. - remote ts > last sync time && no local ts => - Remote row is inserted on local. - remote ts < last sync time && no local ts => - Remote row is deleted. -where the synchronization process is running on the local machine. By -exported, I mean the local values are sent to the remote machine, and the -row on that remote machine is updated to the local values. How does this -sound? - -> 7) The commandline utility, fired off by something like cron, could -> run several times during the day -- command line parameters can be -> implemented to say PUSH ALL CHANGES TO SERVER A, or PULL ALL CHANGES -> FROM SERVER B. - - Or run manually for my purposes. Also, maybe follow it -with a vacuum run on both sides for all databases, as this is going to -potenitally cause lots of table changes that could stand with a cleanup. - -> 1) How far do I go with this? Do I start manhandling the system catalogs (pg_* tables)? - - Initially, I would just stick to user table data... If you have -changes in triggers and other meta-data/executable code, you are going to -want to make syncs of that stuff manually anyway. At least I would want -to. - -> 2) As to #2 and #3 above, I really don't like tools automagically -> changing my tables but at this point I don't see a way around it. I -> guess this is where the testing comes into play. - - Hence the reason for the seperate table with just a row's -identification and last update time. Only modifications to the synced -database is the update trigger, which should be pretty harmless. - -> 3) Security: the replication app will have to have pretty good rights -> to the database so it can add the nessecary functions and triggers, -> modify table schema, etc. - - Just run the sync program as the postgres super user, and there -are no problems. :) - -> So, any "you're insane and should run home to momma" comments? - - No, not at all. Though it probably should be remaned from -replication to synchronization. The former is usually associated with a -continuous stream of updates between the local and remote databases, so -they are almost always in sync, and have a queuing ability if their -connection is loss for span of time as well. Very complex and difficult to -implement, and would require hacking server code. :( Something only Sybase -and Oracle have (as far as I know), and from what I have seen of Sybase's -replication server support (dated by 5yrs) it was a pain to setup and get -running correctly. - The latter, synchronization, is much more managable, and can still -be useful, especially when you have a large database you want in two -places, mainly for read only purposes at one end or the other, but don't -want to waste the time/bandwidth to move and load the entire database each -time it changes on one end or the other. Same idea as mirroring software -for FTP sites, just transfers the changes, and nothing more. - I also like the idea of using Python. I have been using it -recently for some database interfaces (to PostgreSQL of course :), and it -is a very nice language to work with. Some worries about performance of -the program though, as python is only an interpreted lanuage, and I have -yet to really be impressed with the speed of execution of my database -interfaces yet. - Anyway, it sound like a good project, and finally one where I -actually have a clue of what is going on, and the skills to help. So, if -you are interested in pursing this project, I would be more than glad to -help. TTYL. - ---------------------------------------------------------------------------- -| "For to me to live is Christ, and to die is gain." | -| --- Philippians 1:21 (KJV) | ---------------------------------------------------------------------------- -| Ryan Kirkpatrick | Boulder, Colorado | http://www.rkirkpat.net/ | ---------------------------------------------------------------------------- - - - -************ - -From owner-pgsql-hackers@hub.org Sun Dec 26 08:31:09 1999 -Received: from renoir.op.net (root@renoir.op.net [207.29.195.4]) - by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id JAA17976 - for ; Sun, 26 Dec 1999 09:31:07 -0500 (EST) -Received: from hub.org (hub.org [216.126.84.1]) by renoir.op.net (o1/$Revision: 1.10 $) with ESMTP id JAA23337 for ; Sun, 26 Dec 1999 09:28:36 -0500 (EST) -Received: from localhost (majordom@localhost) - by hub.org (8.9.3/8.9.3) with SMTP id JAA90738; - Sun, 26 Dec 1999 09:21:58 -0500 (EST) - (envelope-from owner-pgsql-hackers) -Received: by hub.org (bulk_mailer v1.5); Sun, 26 Dec 1999 09:19:19 -0500 -Received: (from majordom@localhost) - by hub.org (8.9.3/8.9.3) id JAA90498 - for pgsql-hackers-outgoing; Sun, 26 Dec 1999 09:18:21 -0500 (EST) - (envelope-from owner-pgsql-hackers@postgreSQL.org) -Received: from bocs170n.black-oak.COM ([38.149.137.131]) - by hub.org (8.9.3/8.9.3) with ESMTP id JAA90452 - for ; Sun, 26 Dec 1999 09:17:54 -0500 (EST) - (envelope-from dwalker@black-oak.com) -Received: from vmware98 ([151.196.99.113]) - by bocs170n.black-oak.COM (Lotus Domino Release 5.0.1) - with SMTP id 1999122609164808:7 ; - Sun, 26 Dec 1999 09:16:48 -0500 -Message-ID: <002201bf4fb3$623f0220$b263a8c0@vmware98.walkers.org> -From: "Damond Walker" -To: "Ryan Kirkpatrick" -Cc: -Subject: Re: [HACKERS] database replication -Date: Sun, 26 Dec 1999 10:10:41 -0500 -MIME-Version: 1.0 -X-Priority: 3 (Normal) -X-MSMail-Priority: Normal -X-Mailer: Microsoft Outlook Express 4.72.3110.1 -X-MimeOLE: Produced By Microsoft MimeOLE V4.72.3110.3 -X-MIMETrack: Itemize by SMTP Server on notes01n/BOCS(Release 5.0.1|July 16, 1999) at 12/26/99 - 09:16:51 AM, - Serialize by Router on notes01n/BOCS(Release 5.0.1|July 16, 1999) at 12/26/99 - 09:16:54 AM, - Serialize complete at 12/26/99 09:16:54 AM -Content-Transfer-Encoding: 7bit -Content-Type: text/plain; - charset="iso-8859-1" -Sender: owner-pgsql-hackers@postgreSQL.org -Status: OR - -> -> I too have been thinking about this some over the last year or ->two, just trying to find a quick and easy way to do it. I am not so ->interested in replication, as in synchronization, as in between a desktop ->machine and a laptop, so I can keep the databases on each in sync with ->each other. For this sort of purpose, both the local and remote databases ->would be "idle" at the time of syncing. -> - - I don't think it would matter if the databases are idle or not to be -honest with you. At any single point in time when you replicate I'd figure -that the database would be in a consistent state. So, you should be able to -replicate (or sync) a remote database that is in use. After all, you're -getting a snapshot of the database as it stands at 8:45 PM. At 8:46 PM it -may be totally different...but the next time syncing takes place those -changes would appear in your local copy. - - The one problem you may run into is if the remote host is running a -large batch process. It's very likely that you will get 50% of their -changes when you replicate...but then again, that's why you can schedule the -event to work around such things. - -> How about a single, seperate table with the fields of 'database', ->'tablename', 'oid', 'last_changed', that would store the same data as your ->PGR_TIME field. It would be seperated from the actually data tables, and ->therefore would be totally transparent to any database interface ->applications. The 'oid' field would hold each row's OID, a nice, unique ->identification number for the row, while the other fields would tell which ->table and database the oid is in. Then this table can be compared with the ->this table on a remote machine to quickly find updates and changes, then ->each differences can be dealt with in turn. -> - - The problem with OID's is that they are unique at the local level but if -you try and use them between servers you can run into overlap. Also, if a -database is under heavy use this table could quickly become VERY large. Add -indexes to this table to help performance and you're taking up even more -disk space. - - Using the PGR_TIME field with an index will allow us to find rows which -have changed VERY quickly. All we need to do now is somehow programatically -find the primary key for a table so the person setting up replication (or -syncing) doesn't have to have an indepth knowledge of the schema in order to -setup a syncing schedule. - -> -> I like this idea, better than any I have come up with yet. Though, ->how are you going to handle DELETEs? -> - - Oops...how about defining a trigger for this? With deletion I guess we -would have to move a flag into another table saying we deleted record 'X' -with this primary key from this table. - -> -> Yea, this is indeed the sticky part, and would indeed require some ->fine-tunning. Basically, the way I see it, is if the two timestamps for a ->single row do not match (or even if the row and therefore timestamp is ->missing on one side or the other altogether): -> local ts > remote ts => Local row is exported to remote. -> remote ts > local ts => Remote row is exported to local. -> local ts > last sync time && no remote ts => -> Local row is inserted on remote. -> local ts < last sync time && no remote ts => -> Local row is deleted. -> remote ts > last sync time && no local ts => -> Remote row is inserted on local. -> remote ts < last sync time && no local ts => -> Remote row is deleted. ->where the synchronization process is running on the local machine. By ->exported, I mean the local values are sent to the remote machine, and the ->row on that remote machine is updated to the local values. How does this ->sound? -> - - The replication part will be the most complex...that much is for -certain... - - I've been writing systems in Lotus Notes/Domino for the last year or so -and I've grown quite spoiled with what it can do in regards to replication. -It's not real-time but you have to gear your applications to this type of -thing (it's possible to create documents, fire off email to notify people of -changes and have the email arrive before the replicated documents do). -Replicating large Notes/Domino databases takes quite a while....I don't see -any kind of replication or syncing running in a blink of an eye. - - Having said that, a good algo will have to be written to cut down on -network traffic and to keep database conversations down to a minimum. This -will be appreciated by people with low bandwidth connections I'm sure -(dial-ups, fractional T1's, etc). - -> Or run manually for my purposes. Also, maybe follow it ->with a vacuum run on both sides for all databases, as this is going to ->potenitally cause lots of table changes that could stand with a cleanup. -> - - What would a vacuum do to a system being used by many people? - -> No, not at all. Though it probably should be remaned from ->replication to synchronization. The former is usually associated with a ->continuous stream of updates between the local and remote databases, so ->they are almost always in sync, and have a queuing ability if their ->connection is loss for span of time as well. Very complex and difficult to ->implement, and would require hacking server code. :( Something only Sybase ->and Oracle have (as far as I know), and from what I have seen of Sybase's ->replication server support (dated by 5yrs) it was a pain to setup and get ->running correctly. - - It could probably be named either way...but the one thing I really don't -want to do is start hacking server code. The PostgreSQL people have enough -to do without worrying about trying to meld anything I've done to their -server. :) - - Besides, I like the idea of having it operate as a stand-alone product. -The only PostgreSQL feature we would require would be triggers and -plpgsql...what was the earliest version of PostgreSQL that supported -plpgsql? Even then I don't see the triggers being that complex to boot. - -> I also like the idea of using Python. I have been using it ->recently for some database interfaces (to PostgreSQL of course :), and it ->is a very nice language to work with. Some worries about performance of ->the program though, as python is only an interpreted lanuage, and I have ->yet to really be impressed with the speed of execution of my database ->interfaces yet. - - The only thing we'd need for Python is the Python extensions for -PostgreSQL...which in turn requires libpq and that's about it. So, it -should be able to run on any platform supported by Python and libpq. Using -TK for the interface components will require NT people to get additional -software from the 'net. At least it did with older version of Windows -Python. Unix folks should be happy....assuming they have X running on the -machine doing the replication or syncing. Even then I wrote a curses based -Python interface awhile back which allows buttons, progress bars, input -fields, etc (I called it tinter and it's available at -http://iximd.com/~dwalker). It's a simple interface and could probably be -cleaned up a bit but it works. :) - -> Anyway, it sound like a good project, and finally one where I ->actually have a clue of what is going on, and the skills to help. So, if ->you are interested in pursing this project, I would be more than glad to ->help. TTYL. -> - - - That would be a Good Thing. Have webspace somewhere? If I can get -permission from the "powers that be" at the office I could host a website on -our (Domino) webserver. - - Damond - - -************ - -From owner-pgsql-hackers@hub.org Sun Dec 26 19:11:48 1999 -Received: from hub.org (hub.org [216.126.84.1]) - by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id UAA26661 - for ; Sun, 26 Dec 1999 20:11:46 -0500 (EST) -Received: from localhost (majordom@localhost) - by hub.org (8.9.3/8.9.3) with SMTP id UAA14959; - Sun, 26 Dec 1999 20:08:15 -0500 (EST) - (envelope-from owner-pgsql-hackers) -Received: by hub.org (bulk_mailer v1.5); Sun, 26 Dec 1999 20:07:27 -0500 -Received: (from majordom@localhost) - by hub.org (8.9.3/8.9.3) id UAA14820 - for pgsql-hackers-outgoing; Sun, 26 Dec 1999 20:06:28 -0500 (EST) - (envelope-from owner-pgsql-hackers@postgreSQL.org) -Received: from mtiwmhc02.worldnet.att.net (mtiwmhc02.worldnet.att.net [204.127.131.37]) - by hub.org (8.9.3/8.9.3) with ESMTP id UAA14749 - for ; Sun, 26 Dec 1999 20:05:39 -0500 (EST) - (envelope-from rkirkpat@rkirkpat.net) -Received: from [192.168.3.100] ([12.74.72.56]) - by mtiwmhc02.worldnet.att.net (InterMail v03.02.07.07 118-134) - with ESMTP id <19991227010506.WJVW1914@[12.74.72.56]>; - Mon, 27 Dec 1999 01:05:06 +0000 -Date: Sun, 26 Dec 1999 18:05:02 -0700 (MST) -From: Ryan Kirkpatrick -X-Sender: rkirkpat@excelsior.rkirkpat.net -To: Damond Walker -cc: pgsql-hackers@postgreSQL.org -Subject: Re: [HACKERS] database replication -In-Reply-To: <002201bf4fb3$623f0220$b263a8c0@vmware98.walkers.org> -Message-ID: -MIME-Version: 1.0 -Content-Type: TEXT/PLAIN; charset=US-ASCII -Sender: owner-pgsql-hackers@postgreSQL.org -Status: OR - -On Sun, 26 Dec 1999, Damond Walker wrote: - -> > How about a single, seperate table with the fields of 'database', -> >'tablename', 'oid', 'last_changed', that would store the same data as your -> >PGR_TIME field. It would be seperated from the actually data tables, and -... -> The problem with OID's is that they are unique at the local level but if -> you try and use them between servers you can run into overlap. - - Yea, forgot about that point, but became dead obvious once you -mentioned it. Boy, I feel stupid now. :) - -> Using the PGR_TIME field with an index will allow us to find rows which -> have changed VERY quickly. All we need to do now is somehow programatically -> find the primary key for a table so the person setting up replication (or -> syncing) doesn't have to have an indepth knowledge of the schema in order to -> setup a syncing schedule. - - Hmm... Yea, maybe look to see which field(s) has a primary, unique -index on it? Then use those field(s) as a primary key. Just require that -any table to be synchronized to have some set of fields that uniquely -identify each row. Either that, or add another field to each table with -our own, cross system consistent, identification system. Don't know which -would be more efficient and easier to work with. - The former could potentially get sticky if it takes a lots of -fields to generate a unique key value, but has the smallest effect on the -table to be synced. The latter could be difficult to keep straight between -systems (local vs. remote), and would require a trigger on inserts to -generate a new, unique id number, that does not exist locally or -remotely (nasty issue there), but would remove the uniqueness -requirement. - -> Oops...how about defining a trigger for this? With deletion I guess we -> would have to move a flag into another table saying we deleted record 'X' -> with this primary key from this table. - - Or, according to my logic below, if a row is missing on one side -or the other, then just compare the remaining row's timestamp to the last -synchronization time (stored in a seperate table/db elsewhere). The -results of the comparsion and the state of row existences tell one if the -row was inserted or deleted since the last sync, and what should be done -to perform the sync. - -> > Yea, this is indeed the sticky part, and would indeed require some -> >fine-tunning. Basically, the way I see it, is if the two timestamps for a -> >single row do not match (or even if the row and therefore timestamp is -> >missing on one side or the other altogether): -> > local ts > remote ts => Local row is exported to remote. -> > remote ts > local ts => Remote row is exported to local. -> > local ts > last sync time && no remote ts => -> > Local row is inserted on remote. -> > local ts < last sync time && no remote ts => -> > Local row is deleted. -> > remote ts > last sync time && no local ts => -> > Remote row is inserted on local. -> > remote ts < last sync time && no local ts => -> > Remote row is deleted. -> >where the synchronization process is running on the local machine. By -> >exported, I mean the local values are sent to the remote machine, and the -> >row on that remote machine is updated to the local values. How does this -> >sound? - -> Having said that, a good algo will have to be written to cut down on -> network traffic and to keep database conversations down to a minimum. This -> will be appreciated by people with low bandwidth connections I'm sure -> (dial-ups, fractional T1's, etc). - - Of course! In reflection, the assigned identification number I -mentioned above might be the best then, instead of having to transfer the -entire set of key fields back and forth. - -> What would a vacuum do to a system being used by many people? - - Probably lock them out of tables while they are vacuumed... Maybe -not really required in the end, possibly optional? - -> It could probably be named either way...but the one thing I really don't -> want to do is start hacking server code. The PostgreSQL people have enough -> to do without worrying about trying to meld anything I've done to their -> server. :) - - Yea, they probably would appreciate that. They already have enough -on thier plate for 7.x as it is! :) - -> Besides, I like the idea of having it operate as a stand-alone product. -> The only PostgreSQL feature we would require would be triggers and -> plpgsql...what was the earliest version of PostgreSQL that supported -> plpgsql? Even then I don't see the triggers being that complex to boot. - - No, provided that we don't do the identification number idea -(which the more I think about it, probably will not work). As for what -version support plpgsql, I don't know, one of the more hard-core pgsql -hackers can probably tell us that. - -> The only thing we'd need for Python is the Python extensions for -> PostgreSQL...which in turn requires libpq and that's about it. So, it -> should be able to run on any platform supported by Python and libpq. - - Of course. If it ran on NT as well as Linux/Unix, that would be -even better. :) - -> Unix folks should be happy....assuming they have X running on the -> machine doing the replication or syncing. Even then I wrote a curses -> based Python interface awhile back which allows buttons, progress -> bars, input fields, etc (I called it tinter and it's available at -> http://iximd.com/~dwalker). It's a simple interface and could -> probably be cleaned up a bit but it works. :) - - Why would we want any type of GUI (X11 or curses) for this sync -program. I imagine just a command line program with a few options (local -machine, remote machine, db name, etc...), and nothing else. - Though I will take a look at your curses interface, as I have been -wanting to make a curses interface to a few db interfaces I have, in a -simple as manner as possible. - -> That would be a Good Thing. Have webspace somewhere? If I can get -> permission from the "powers that be" at the office I could host a website on -> our (Domino) webserver. - - Yea, I got my own web server (www.rkirkpat.net) with 1GB+ of disk -space available, sitting on a decent speed DSL. Even can setup of a -virtual server if we want (i.e. pgsync.rkirkpat.net :). CVS repository, -email lists, etc... possible with some effort (and time). - So, where should we start? TTYL. - - PS. The current pages on my web site are very out of date at the -moment (save for the pgsql information). I hope to have updated ones up -within the week. - ---------------------------------------------------------------------------- -| "For to me to live is Christ, and to die is gain." | -| --- Philippians 1:21 (KJV) | ---------------------------------------------------------------------------- -| Ryan Kirkpatrick | Boulder, Colorado | http://www.rkirkpat.net/ | ---------------------------------------------------------------------------- - - -************ - -From owner-pgsql-hackers@hub.org Mon Dec 27 12:33:32 1999 -Received: from hub.org (hub.org [216.126.84.1]) - by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id NAA24817 - for ; Mon, 27 Dec 1999 13:33:29 -0500 (EST) -Received: from localhost (majordom@localhost) - by hub.org (8.9.3/8.9.3) with SMTP id NAA53391; - Mon, 27 Dec 1999 13:29:02 -0500 (EST) - (envelope-from owner-pgsql-hackers) -Received: by hub.org (bulk_mailer v1.5); Mon, 27 Dec 1999 13:28:38 -0500 -Received: (from majordom@localhost) - by hub.org (8.9.3/8.9.3) id NAA53248 - for pgsql-hackers-outgoing; Mon, 27 Dec 1999 13:27:40 -0500 (EST) - (envelope-from owner-pgsql-hackers@postgreSQL.org) -Received: from gtv.ca (h139-142-238-17.cg.fiberone.net [139.142.238.17]) - by hub.org (8.9.3/8.9.3) with ESMTP id NAA53170 - for ; Mon, 27 Dec 1999 13:26:40 -0500 (EST) - (envelope-from aaron@genisys.ca) -Received: from stilborne (24.67.90.252.ab.wave.home.com [24.67.90.252]) - by gtv.ca (8.9.3/8.8.7) with SMTP id MAA01200 - for ; Mon, 27 Dec 1999 12:36:39 -0700 -From: "Aaron J. Seigo" -To: pgsql-hackers@hub.org -Subject: Re: [HACKERS] database replication -Date: Mon, 27 Dec 1999 11:23:19 -0700 -X-Mailer: KMail [version 1.0.28] -Content-Type: text/plain -References: <199912271135.TAA10184@netrinsics.com> -In-Reply-To: <199912271135.TAA10184@netrinsics.com> -MIME-Version: 1.0 -Message-Id: <99122711245600.07929@stilborne> -Content-Transfer-Encoding: 8bit -Sender: owner-pgsql-hackers@postgreSQL.org -Status: OR - -hi.. - -> Before anyone starts implementing any database replication, I'd strongly -> suggest doing some research, first: -> -> http://sybooks.sybase.com:80/onlinebooks/group-rs/rsg1150e/rs_admin/@Generic__BookView;cs=default;ts=default - -good idea, but perhaps sybase isn't the best study case.. here's some extremely -detailed online coverage of Oracle 8i's replication, from the oracle online -library: - -http://bach.towson.edu/oracledocs/DOC/server803/A54651_01/toc.htm - --- -Aaron J. Seigo -Sys Admin - -************ - -From owner-pgsql-hackers@hub.org Thu Dec 30 08:01:09 1999 -Received: from renoir.op.net (root@renoir.op.net [207.29.195.4]) - by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id JAA10317 - for ; Thu, 30 Dec 1999 09:01:08 -0500 (EST) -Received: from hub.org (hub.org [216.126.84.1]) by renoir.op.net (o1/$Revision: 1.10 $) with ESMTP id IAA02365 for ; Thu, 30 Dec 1999 08:37:10 -0500 (EST) -Received: from localhost (majordom@localhost) - by hub.org (8.9.3/8.9.3) with SMTP id IAA87902; - Thu, 30 Dec 1999 08:34:22 -0500 (EST) - (envelope-from owner-pgsql-hackers) -Received: by hub.org (bulk_mailer v1.5); Thu, 30 Dec 1999 08:32:24 -0500 -Received: (from majordom@localhost) - by hub.org (8.9.3/8.9.3) id IAA85771 - for pgsql-hackers-outgoing; Thu, 30 Dec 1999 08:31:27 -0500 (EST) - (envelope-from owner-pgsql-hackers@postgreSQL.org) -Received: from sandman.acadiau.ca (dcurrie@sandman.acadiau.ca [131.162.129.111]) - by hub.org (8.9.3/8.9.3) with ESMTP id IAA85234 - for ; Thu, 30 Dec 1999 08:31:10 -0500 (EST) - (envelope-from dcurrie@sandman.acadiau.ca) -Received: (from dcurrie@localhost) - by sandman.acadiau.ca (8.8.8/8.8.8/Debian/GNU) id GAA18698; - Thu, 30 Dec 1999 06:30:58 -0400 -From: Duane Currie -Message-Id: <199912301030.GAA18698@sandman.acadiau.ca> -Subject: Re: [HACKERS] database replication -In-Reply-To: from "DWalker@black-oak.com" at "Dec 24, 99 10:27:59 am" -To: DWalker@black-oak.com -Date: Thu, 30 Dec 1999 10:30:58 +0000 (AST) -Cc: pgsql-hackers@postgresql.org -X-Mailer: ELM [version 2.4ME+ PL39 (25)] -MIME-Version: 1.0 -Content-Type: text/plain; charset=US-ASCII -Content-Transfer-Encoding: 7bit -Sender: owner-pgsql-hackers@postgresql.org -Status: OR - -Hi Guys, - -Now for one of my REALLY rare posts. -Having done a little bit of distributed data systems, I figured I'd -pitch in a couple cents worth. - -> 2) The replication system will need to add at least one field to each -> table in each database that needs to be re plicated.  This -> field will be a date/time stamp which identifies the " last -> update" of the record.  This field will be called PGR_TIME -> for la ck of a better name.  Because this field will be used -> from within programs and triggers it can be longer so as to not -> mistake it for a user field. - -I just started reading this thread, but I figured I'd throw in a couple -suggestions for distributed data control (a few idioms I've had to -deal with b4): - - Never use time (not reliable from system to system). Use - a version number of some sort that can stay consistent across - all replicas - - This way, if a system's time is or goes out of wack, it doesn't - cause your database to disintegrate, and it's easier to track - conflicts (see below. If using time, the algorithm gets - nightmarish) - - - On an insert, set to version 1 - - - On an update, version++ - - - On a delete, mark deleted, and add a delete stub somewhere for the - replicator process to deal with in sync'ing the databases. - - - If two records have the same version but different data, there's - a conflict. A few choices: - 1. Pick one as the correct one (yuck!! invisible data loss) - 2. Store both copies, pick one as current, and alert - database owner of the conflict, so they can deal with - it "manually." - 3. If possible, some conflicts can be merged. If a disjoint - set of fields were changed in each instance, these changes - may both be applied and the record merged. (Problem: - takes a lot more space. Requires a version number for - every field, or persistent storage of some old records. - However, this might help the "which fields changed" issue - you were talking about in #6) - - - A unique id across all systems should exist (or something that - effectively simulates a unique id. Maybe a composition of the - originating oid (from the insert) and the originating database - (oid of the database's record?) might do it. Store this as - an extra field in every record. - - (Two extra fieldss so far: 'unique id' and 'version') - -I do like your approach: triggers and a separate process. (Maintainable!! :) - -Anyway, just figured I'd throw in a few suggestions, -Duane - -************ - -From owner-pgsql-patches@hub.org Sun Jan 2 23:01:38 2000 -Received: from renoir.op.net (root@renoir.op.net [207.29.195.4]) - by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id AAA16274 - for ; Mon, 3 Jan 2000 00:01:28 -0500 (EST) -Received: from hub.org (hub.org [216.126.84.1]) by renoir.op.net (o1/$Revision: 1.10 $) with ESMTP id XAA02655 for ; Sun, 2 Jan 2000 23:45:55 -0500 (EST) -Received: from hub.org (hub.org [216.126.84.1]) - by hub.org (8.9.3/8.9.3) with ESMTP id XAA13828; - Sun, 2 Jan 2000 23:40:47 -0500 (EST) - (envelope-from owner-pgsql-patches@hub.org) -Received: by hub.org (TLB v0.10a (1.23 tibbs 1997/01/09 00:29:32)); Sun, 02 Jan 2000 23:38:34 +0000 (EST) -Received: (from majordom@localhost) - by hub.org (8.9.3/8.9.3) id XAA13624 - for pgsql-patches-outgoing; Sun, 2 Jan 2000 23:37:36 -0500 (EST) - (envelope-from owner-pgsql-patches@postgreSQL.org) -Received: from falla.videotron.net (falla.videotron.net [205.151.222.106]) - by hub.org (8.9.3/8.9.3) with ESMTP id XAA13560 - for ; Sun, 2 Jan 2000 23:37:02 -0500 (EST) - (envelope-from P.Marchesso@Videotron.ca) -Received: from Videotron.ca ([207.253.210.234]) - by falla.videotron.net (Sun Internet Mail Server sims.3.5.1999.07.30.00.05.p8) - with ESMTP id <0FNQ000TEST8VI@falla.videotron.net> for pgsql-patches@postgresql.org; Sun, - 2 Jan 2000 23:37:01 -0500 (EST) -Date: Sun, 02 Jan 2000 23:39:23 -0500 -From: Philippe Marchesseault -Subject: [PATCHES] Distributed PostgreSQL! -To: pgsql-patches@postgreSQL.org -Message-id: <387027FB.EB88D757@Videotron.ca> -MIME-version: 1.0 -X-Mailer: Mozilla 4.51 [en] (X11; I; Linux 2.2.11 i586) -Content-type: MULTIPART/MIXED; BOUNDARY="Boundary_(ID_GeYGc69fE1/bkYLTPwOGFg)" -X-Accept-Language: en -Sender: owner-pgsql-patches@postgreSQL.org -Precedence: bulk -Status: ORr - -This is a multi-part message in MIME format. - ---Boundary_(ID_GeYGc69fE1/bkYLTPwOGFg) -Content-type: text/plain; charset=us-ascii -Content-transfer-encoding: 7bit - -Hi all! - -Here is a small patch to make postgres a distributed database. By -distributed I mean that you can have the same copy of the database on N -different machines and keep them all in sync. -It does not improve performances unless you distribute your clients in a -sensible manner. It does not allow you to do parallel selects. - -The support page is : pages.infinit.net/daemon and soon to be in -english. - -The patch was tested with RedHat Linux 6.0 on Intel with kernel 2.2.11. -Only two machines where used so i'm not competely sure that it works -with more than two. -But it should- - -I would like to know if somebody else is interested in this otherwise -i'm probably not gonna keep it growing. So please reply me to my e-mail -(P.Marchesso@videotron.ca) to give me an idea of the amount of people -interested in this. - -Thanks all. - -Philippe Marchesseault - --- -It's not the size of the dog in the fight, -but the size of the fight in the dog. - -Archie Griffen - - - ---Boundary_(ID_GeYGc69fE1/bkYLTPwOGFg) -Content-type: application/octet-stream; name=replicator-0.1.tgz -Content-disposition: attachment; filename=replicator-0.1.tgz -Content-transfer-encoding: base64 - -H4sIAOeZbzgAA+w8a3PayLL7Nara/9Dx3iRAME/brO11tjDGDjc2OIBvTm5OihLSAFoLiUgi -LGfX//12z0MaAX7UOU72nLpW7cagmenp6enp9xCwmetYZuQHxR++1QM7pVq5DD8AQG2vxv+W -d3b4X/mUsKVUrdbKO7XKLrZWdkqVH2D3m2GkPfMwMgOAH2Z+GI0DFt7ejwW3Nv7nPkGy/xfm -NRs5Lnv0Ocql0l6pdOv+l8u1XbH/e7W90m6NWsu7pR+g9OiYbHj+n+9/q904vzppwhFst6A4 -D4NiGFjF2Tj84vJPjme5c5vFra5vma5sl21FwzBdFw7AGwb+NQtgKv8mvAXYZWw6HnayXGZ6 -d3cVXQ7AeBZMYXukOquv0/TXZKhhGLJrjEvBAvlhYjwbW5b2/r8ycu1Z2B6P7SFs+8lMYRQ4 -s/irodA8UJOvwZ3eCXeahqu+GtqyD7SFrEFPNW2aQAO07c6+wPb52m65ztB49pNAQKfZzIys -CR7DA+OZ7YxGm5hgaFrXzLOLkeXPiuqgFHzXfnhnC96A+rJX2C3wWf9q5n96dPn/zbaH5L+U -95vkf6VUS/T/Xq2Crbt7ldqT/P8eD+6NWS6V8+VSzXgDxdzlBfTnDE6YBZWfcSMOqtWDnX0o -7+/v54rY4yelELaKE3/KYqYpaoykZNeW3v+XcBkWo+UMZcHkzVqDM7M2vZ6GY/G6mHuOmHEM -jL3arrn3cxU/PnO8SJmQan72fs6C5SH22sNeu/m9/T3qWcwBAuiZEfw36hYoQ7l6UKrg6pD9 -SiXqATlojSCaMECOiNiUeVFIHwP8s3CiCZjw6vWrPPXw0t3AQkqEMAr8qQREzZpIngW+xcKw -IFppEepxRpD5QggPSDB740+lz3B0RBNl/+C9Ya0VXsGrQ9GWXjM2ndbPe03eeEP/MDdkf9zS -td+90npy+iJme7V9a29/3/gFnskH4CVHIQoYG7hOGOXBZvSvabmdrywIHJtlD43t7W0x0V3P -gyDVSjsmegH5WnWfID7j/6wxZvWgXDso7SSMSX0APuC2oPKRG4DThRHR2IleheD5Ee5hyFxm -Rcnm8aEgQcTbQThuv8F9nZqe3UeuhedH0Lg4GfSa581GH16+XKGo2C5cf69/0R9c9M4AefdQ -viM2nhLvI93L6mU4wy2NRhnRyH5HQmy9fhFu5VNbnlXd5R+HDwg9m/6MWZRpd06ag3fNj3lo -XTYGjW6z3oc/obS3t5fNw0vslAfEZtBr/W8zD6Us/FKKUQWYIdX9ILN1EY57SLVTE81/+wC2 -4lk5cwgWURshGOWvlltPz+M8mtiOTdjHnuNu/69cre0I/69Sq5XLpSq1Vit7T/r/ezzFnP4Y -/YkTQivk8nOGhrw5ZvgZtSZ6eP4ijIWe43vKnO+9PwfbjMyhGTJAF880bIdk13AeMRuY99UJ -fI8kbcEwGv5sGTjjSQSZRpbLboDLieM6M5SNF2ZgTVBNMnPuRoZABTXnODCn4JB6ZSjc/VG0 -MAN2CEt/Dhaq8oAl06GcB5TXRXRrpj46M0t6Mfds9NxoQRELpiH4Qsmfta/gjHksMF24nA9x -VXDuWMwLmWHivPQmnOAChkve/ZRm78nZ4dRHqJwKh8DQNsAJUIGFRJWKISeQ0PKABkAGKYgI -B+DPaFAWsVyCi7ojHldYX7BORkeYHBN/JvcDF7Zw0OseMkDOHM3dvIE94UOr/7Zz1Yd6+yN8 -qHe79Xb/4yE3X3xsZV+l6eJMcRsRLC4mML1oiUQxLprdxlvsXz9unbf6Hwnt01a/3ez14LTT -hTpc1rv9VuPqvN6Fy6vuZafXLKC6Y4QQM24nJ4wQ0tRHqtksQgWDZpDxETcvRJTQg5yYX0lV -W8z5igiZaEvNlg/ZI9f3xsIuizS6HZK2R0Wfh0XgIENE/vruGcnuodL0rEIedvehz5AmDC5d -02KwDb05Da9WS3k4Rj6nrhd1gFKlXC5vl6ulWh6uevWCkTo+RcNAIyQJBRxAf0J8HdJGyxCC -Iw+XMAlxBeh8R7gJ4RS3lO+NhwSbiiNGJiXSxPNtZqBZqT8CXKhW6HMmpI6qpUCTI9Vxwvbr -8goWwdzz0LoA31uFS8Cmoi9KPOROvn9tDlrABIQmV4FNBI0vwrQsNkNb2fI9Dy0sRB93GnIE -/wOZzAsmrGkSJfTF9gnKwsRl0wwEThuKJ5060QmnSQiKJBla2pmLeq/f7MJxt/MO/1x2Ow1k -02Yva5Ahp7kQke345ECkXrnOcPUdWVrpd2gYeatD0SEJfeuaRen3Hosc/L/oeOv9aXn01rjF -D9roBW30gZK3W4l/ZRhffQeF7O9ONPDwZLjLDLlEhOWpjXYeWnjCyNxquH5IG56QmIw8AAvf -s4wcQC8IVqaEH2+ImXsT349WvB3JcZp/E02CuWAc3B9SGWjBzhluBceuPpu5y54anrEmZpBD -cFM8pN3mWavX79b7rU4bN3gQsDHHmhbxxbHzgIZul4WoDwi1tGmNL4q5fhoRjzGbH4lrz1+I -07NANzWF2wLFO/oIEaKXtsz59NtvWic0F/KENVvqxjlhzClEExdPjq/OFG0z8sRnIV7lAbxA -Nn1hZ//uoUGfghK4zEvgZhVIXC7iII16wd+3mPWHdyDw/qp1cnAEL2w+L8JU4BO/gdP1NsdA -dwq0ZZFLNeLOgeAaDpID5eeEXNZm5bh1lo35jX9VnVP96g08rFpH/r23setp/eq8r3Xl3zf2 -bJ10L7SO9HVzv3a/q/fDr7f0+5/6eaojft/Ys91BKmo9+XfVkx+mMv9yIw6UMLNQmnFtfzr3 -+FnkTqsQqa0TMmHAM6dMqUEu1ddOHXq05ERzkcfPWZeNHZLbbeyeuedkrR8nvqX6cUAIg/7H -yyY1Tdl05TRwiMjOzj+YP0rN9k9y9L/Ko2L1gdCbTwz7CAyLNOMsC2/JjOD6GY1G5B+p/Ikt -j4UqSpiwIfRL32/HrasqKeY7cjAYl+w6/yCLJzt5KFnWm0+PlxELlfzNdYXFmOrMJTquNaN6 -I/uhZflVqjfkKL33ZuYlNruTzxrcaqVYUrABhY0HH9JHM4WF4v0FekEMR4hp9RWmI5J8jXJP -b1+noGy8QkXx1dU9cIVqUSsLo+fGkCeLMG6gsYD2/oL8ExRXaOBJ+5zJrY6RD7HBmmTEWyFx -NJQs8iTPO413g8t6412zf4BWKzOvD1c7nMXtsgPi0PFc9Kz8cXrWZ3wEJ4MaYjyDVcskRodL -uNVtegYb0Wj+rZUA1Q0xZVMhUk18jQRFq95dHnC6MImjE/4aEwXQQRqR73uQkGKzpJPO0tC0 -5RIpQqnvEmzEJG6+0XYP/yczihzsWBkJz8xjX3lmNpoHdLQS9bX5KRp0UKem42X4Xl6z5QDP -PVp7zH7HlofSWsPXSpeJNlIufrBEFTieCrNMnXm2uERNBZufYu4qFO45+QeZrDj8ZJ6awwYe -Jjtg3qdy6fOhWp6UWqQzSbGSN27JfvFYLx6KR6q0OnUaEB6+IdIHQXEwtKfclw1n5sJjNoeJ -7DO3hPAzbTsYoBvPU7J1/KJEGzWOUCELg7aH5J6tOEO4yejaFnJFgYYUcmIc4il8kkz9dNBq -N/t56NHJQYHWrF/I8w7Jgb/ruFvI3hGT8BJ2Sh35G6VM43UU0K8YjMyp41JaQWJxuNZj5geU -qzmCCTrTYeYcRW6zPbjsdPvZ9c4m/8D/xCNa7frJSXdQb3/kZ+qcBqApwek8/AcL/MzLTBoK -vUSR93NiASDdhg7aF4J2ecisbFAuCy9jGLH4XOn0UP1AM+n6QD+OI1vqBoEUMSV6BQqtYxQn -A5Rv+tbdNZMY/pC5ijkZBzj3/VlIR4cOOadhWgNt4Fyh/RXrxme0x6VL8gop30O6EWNuJB+C -ynLvjbD5oNx/buNqnlzC/VI+cn2nZkPgItxw504mCKM0l2il1d9dRBUTrDnMd1CY20xSHxZz -ZzzdlAx/LheiVi6iJxTdYYs4HCS7o2CTMjIxstJ0uOTGtRB9Dz/h1P8hy9ARbY1ItJnKBOSy -TqKboKQwOtIQ4fLeU0uiBZHM1UJU4HsrsatCoRCrw02WZMwA2cM1raWrq8RM2kSNNbixxuM2 -DwF4/vxuMpGkNkcsWuK0UYK1IJxIuyoifCCdx2YU5LOuVzVPgqiuthI19Po1ZXylJkzvzQ0Z -FujmIMQPdHQ5pI1benNH/H9D2v7Rcwz31P9VK3s7qv6vWqtWeP3fTvUp//M9nqf8z1P+5yn/ -83j5H+MnZ4QMN5K5icHbH42f8KvjMe1N/Oqi/rdBo9NuNxsUfOhRLVLcplnIXEzu7lSrCSxp -ISrXpFJKA20fJ63VSnm/koBFt/ld86MaCLs/7yRDVTGJatwraQglITw5Eo9v0ijDdGpkuVRB -uCLWf3Lcrl80j/7Ymi7t4dbNITmSypMKtYyCTDvRN4asTSzOQwkUjyS9ibpcSB6HDKUppTMc -Mg+pA5GcoUsGf2iBg7wWJMjr7n9ed9vzcajzBujfQeeyKaJBiKmCLczLP7ingvPykAXPn+AK -gUcMPikSfBbRMxV0wW+x4xwHfvG/eZhIl8TK48HfdaLIxsiX0SYWGLLMjKxHbjrBBZ0xPXtI -4hHlA5rDOJOQNsrz5quZB4xH7mDDKosS0dYJODai44wcFgIzrQmX+3KrkjQintwpNhIrZBx+ -egOmEJiitPs1cdBFgoUTboKg2uaUfdpFR924ScUBkfh4dpiHchyxUYcnV/zxqfboER/N/vP+ -mvof2N2t7Kj6H5Rxu7z+p7z7ZP99j+fJ/nuy/57sv0et/7mj2sXxnMgxXQqP3RoA/5E0O1kB -vDRGi55oJpKQ1NQT1eG/Y71Lqre9Mn/kTNm3LIrR8hA/bg7CCzKqUKbsQMYIGV453Lcx43aJ -nibPqTz5vclKFZN0WfxKptyFYXVpBiFTMf6RM55Ls5YuX8a1UEJSiPiQIeNIVN4iA4ZyY2Tk -6RHyAd6/kg8oFvudkw78Znrz0BDRLEVCRAg/EW2HSypqyGzxXgpcshO35hHWOt2TSljrz9MI -1D+XUWFi+TaX1TDdfjPhLyUQlVNYAbaeVNictFH7It/cGahOprgr57CeCti0Z4m8EFBvyUhL -vCl+OA/Ws+hyy/iG6Y0FZbFz7halSDleFIiOgvI3vCWKZJTOc3QbuPAzw+vVkjFicx+dBREV -D1EL2f40Q5Ih0746PxdFJKmZcYIjkP14KxKXymZiyj4wu/9AQvJrJA9I7ce05NKeRtHSabVk -FKXlNl9rqizmjtsbh2s5mdWCnY2lFEmCQaVWzLj0TZaTKlP3C5oP4rJmEsEWNTiB9fWWGpx0 -Gc4mEqq6BVF+A4nc0LLnWoJBrwFAumhO+qFKRFHpUTo3n5TPpYt70vxwVxXEg4UggbxD9tH/ -SQ3lveH2f7tH8//0S8aPOsd9/l9V+n90/3Nnl+5/7JbK5Sf/73s838gES16i9Tn7sj1i/HUx -d+b6Q9Mls+byjHRVDsYDSn3hx08UM/x8aKRLF3mzXtynuol6DUrihTyiCOl0oFSEFGcjISys -JinuVOIxQMM7EtG3cK2A7Vj0lfXKswGOp8m40ODBKzlDT1zN3C1XPitrz7nXTDTimsoVrRXP -c0dtJVU5kd9IMTkqtUZVnJoAvFd073GGfqzNfjPR0QRunl3URbUEKoWMc1RCX+mXmILw+rUj -5WEx1/ahzUQOWoUbxbVXzxc2KDbI1cdqAzLrO+V8pnJqygqvaPKsFKcy26rp0nUgEkPKg6ar -WrOoVV3XtzK3EkqS+HaYDzIaOKjXr2NekwpH3d9MsUEetshCOnqBHilaqUe7O9UK2EOyo474 -3c5bjCkRIt9Q2v0i5JXcqUl4t+TYOESay/eyiz3MrHc2xA5dvqdCwnmY0QdnaX+SJMTguH4i -GUHEtsPxp3Ll589SGyeXVnGpf/cyidKIM+vEgcgg8vzAi/AAqSGLcR9CgFgf4yR31lHH0fT4 -0JszimuE6xcWeOBann4aFqLVazHun6hyLq3gWomJxcQXsfj5Csik3pWXEMbCIswk3uJsQPgT -JS/PAn6HIYeLD1My4iGH8dQJ8DBbE2Zdg7NyQR1N7qk063gUJEacAg7eqwgWplg/t2adSNWg -FIsOdzGs6Uz5Y2SBEjccHaFxhByEPv6cxeXDfDHbb6SVdrThmIqzvjpUzpbiaHE3QYcodzng -layX79nvzEqxqNY7sfmIoZ/TkD//xDGCwj3B3vglSxe2L8+6zd6g0bm4qLdPBp13scEXc3al -tKNY+x7mjstEgY6kNG7JvCYeBnFMdSzzt9BIsXxsUq4xO9BhdpkZ8HXoBrNeWLK5zw0SWxaF -IG4rNZv8qlcH3b3f/CFnexLvps1DVNoNmVAxVBxu4gqSx6QUK/EYFQ7z6W69cnKUCaXYMOR5 -IQk8uT8m7qT5VKiLHiBBcjzkaM+S50+Lgmr5pQQdylmlq01/TEdZ5DECPFXGP3ET4T6n6y/x -r1a48R4nS3xedbUc9WMD8hCrtKdWsfVBFrGmNT/uiG49pau/YN1kWnXPbimLEjPp8qwg5Knk -KV1oJ3VhqxL3ZXJyhEP2H+V//dWP5v/hOTi5aH6DOe75/Z+9cq0W+3/lCv3+G358uv//XZ5u -EpL7WiqUDcM4b120+tzy7cmUlMwD0sVJLrUj9C4oeUTJFxd1/e+Abp/HXKgUKoVymaR1l9lv -zUi27hVKBaPuhn4qrUjgdEgC9II8i+lsHtHdZin2PYavg+sC/T5P6E8ZaY0JjqYK3ZCrMZ5e -kvUHiDAqxlD+EAyaRzzxRj/lQvdnCpChlFX9qv+208Uu3G40hsz1F1nDaIEZhvMpaSuyRma+ -F5pDx3UiMiqTZJPKQhbgCueJc3sLDwInvMY5WrCgAJLhOtdSlgpx7Xi4skS9cp5aVXp5uvqs -sptkTZkWKjpmmOIermkHjJSx+MaErufjbd+aCyHa15EEDpptm+5sYj7nqTeOHIVqIz8ykeoX -9ROO5dwTkCiLS+jYc2FW60lcIABccg8Z8wycwWN2gX7/Rz2G8eEtatRWD1r9X9Ms5JAVQQUp -28JHSFLISLM4yUz1IiKlzHeRLqGHwEOlvayBo4gZcPR4wsgaoIEhMnEkbPTIxB0Ac8qrdAiQ -MF5wmDPFJSHeqFJ5jQwaHGEhjZ/tM/E7QUuGQOcznmmg2xeui+zNf5cHKM3KE3NEQwph2+YS -9TtZNgbn4YW5jPdvw5x5osJwmaR3ycARPMrr1YntDfolPjSUKMQh7Ka6NKPk7w9pP1FFW6yu -xseRDX6JQZ2IgsFvRdH7dqe9fSuImTk2tdLzpCybUrbG284HOOk0aVfhQ6f77leDJzrp1wm8 -NZi0RpXUzSeXWMmhinO9aY+KZ7P5kU3ngFNuHP8dJ7E2vY48idvym+ho1dG+slQfHr9Irtmu -XWfXTF2+l15ee21wSPEvS8WG8vpVeFXIv2K5huSnye0wMglgDWPQadx/2wTyW67arYaIINUv -Ou0zXqDXk+Tvo9xzPGeqsslobwUmUsfkXMCNJ48Hktbse9pvrcbMSGWV11eP7hJd7OPL5dfp -nWidFgU4UTxrqHKQUJ6SuUx3ciqIWfJJMQJ/q1yAAjQV4kAyPvKN2BDFI87cUXKpMLVlHQHM -JikcOrTsib+gKwF55TXrW2mjQBdLxAW5hB1tVhj5MzprkAkZo/pCd5mFse/bVANnamKTV2CM -jERLMJ3hN/pPYqkyPSQYk4nfjdM8mwIdNC491Ulrd+izsU1JYyGFxA+YhbLjqx78X3vX2tzG -dWQ/79T+iAlSFYlVEEyKlmRbqexCJCQhoUiZDzv6tBkSQ3BsAIOdAUhjf/32Od195w4ethMn -dtUuUXFsAjP32befp/ueDN5eppdn8th/yHOXEZAy+z4v71WDn43lt2+9dJ09IGP+Xo4d8CUL -2Rd5qNYFDKadHLeQqiAN9MGg00tznd5nfEoZrBH67WTVCw8Ko8dshd6XdbqcA3MxG+9ZQZOe -DTa7Lu+1WFsB2xECXv6vYpLepByPeYoc+omXPoihXzyTcwPblSJxHkBJpFFKJ7cvXazizeEm -N3Y2s6VaX/omv8mWTGjNZ2sPweAUTQFnqc1KMjtzsqBTUS3b9U1aT97K2a3vxJhNP+TZjEZn -d53v8qSOSiUZ7qIJ9xsnxbqc3CtKSWaNXU7Xp8OF2CiKF6UXyUHIyHbrnMvcFKdBdZiGkRj9 -Z7TQ1sw2DmqKsr3ZTMfvW+ioQll+1CCxc0BIAaZRyaFZgGaS5M3VuwsnUlPzfJE7VonvtS2O -/9lBa3Xuf+t6jblewjhSLBCcSYRQLR4KsJchZVPGCiEzEeZjTn74ZEq5/9+E/dTLClirB92m -H7CZohD04hMU0RxfUm6yAs6HYXifalH5ZHvpRankkKXXxUJdstATqijR2z0JHSQPutM0dIyn -R6CxDoBEo+sOv6HkbKz2abnWFouxQBdayrFe9Ri+gDNR8Qq2cJ1iflN31MeDrvB3Ne3YIa00 -AQzrIntPtYobUf8OUkuV6WQnlE9+6Pk35X/eCy8tF6JO9m6yRyjvb/2J7P/l/F/Uh9j/r37E -/k+fvzww+3//4OWrA9j/+4fPH+3/X+Pz+999Jqzms/ouSX5vduNNVcyDOrCcwxosq1FefSVP -HOzFCoar355bKQ881wdOt2nn8vPhnmkPQTzFbs2k50Uo/yD/OWv+s3nmD7/1gv0f+0TnH5rf -v6SPnzr/6SvD/7/Yf7XP+t+Hh4eP+P9f5dOcf1gC0Pv9cgL/e7b2d0Mxv/XYHz+//BOd/6Oz -j5+Gp+/++X3I+f989/n//HD/1cs0ffn54YsXL/afHwL/dfji5WP9/1/lo3XAmWgxOB2c90/S -j1dvToZHqfwzOL0YJP/mNd2/8QSXbvrnpZhmB19+eZAk6XpKzxdfdvnTzowZzblI0p2fl69e -oFppnfbvxSI+yoQhFaNxzgSM/ecHh18y9SJJB/d5tVIjERb9tFgsHJo0X9GSiTKD5Nlr6X6K -H4u8ToLXfGJZKu4979LBe3OXzeh5KGgush4IXBpwdSf/pmvyscplbJMctUwuWVmILdWW8FIv -Ghc8zfi8LsYzc69m38uX5iiuEqQ2jeAyKjUlhIPnEJDS1EvTNytGAqoM5fK3Z8Yknm5D//1C -7H3tarzMkOKTWxzkx7rCb4mP+dkzQsi/N2tYrfYmosBsLEa/malbayyjl1KHTHZkAgV4Tqnr -Y76XXdk5T6IghoEPguUZIAIPd7C0s+XirqyYFsw6umWyrHX7ZEhPL+BT0td2UWVrcjfwmtGr -kvhinxTXVVatduU4wX2ZZ6PeXsr4CFz/WRPJTrj0NmJ4CsqyB6oJmULzPGNpkFaOW9frrlT5 -LW5HoFPDN7ALmkzmFX0aBHVsH1m9QXvxnmroKgmB+Ig6orOjR2ZjfOlTo51qTFJINDaWV/fS -tTs3Hor6bq8bugrODgVgSdNiJ6DcjyzYOEdgJfEXgVwqFtGrjIcppbaoUV6Hl07GeKOjRCMz -IhY4Xl/31+bdsOZYItbbHZXmSaKTrebuXJZ4dYFsCu4fuVzNXXEQJNdShIaslJUtZvOyGNfF -KBFiBXvCYuYzjQlpJ9oSBg6Srr/Xn0rsSpWHdEV9ipGMer0XFIxG/iDZXV4tcLeWuaKLELDE -8UTLuqLJ1h2NV5Kpirb8IX+SS/FWfsh/yJAh1/UntjZXL1EJugllPtzlOHbJGO5bzlgRM7e5 -NMR+4CQdu/tLqKOYaxDMgxa2VlhXHCM6unp6yvjuGjnDCccD1g2kFpEXgh0R5Uk7fSGJMI6a -Pr+7fOrEwIRRjVuvlGDo0kt8axLeOLKFSixI8SB7usjn9Vfp04M9yiUVle1VF7JMnorhXCJ+ -YmQSSaaHO5TXxhrV/HGSj+WYU+LVtYEt0XQ33mFNg/VtjPvjqBGN73Iv6N5V9vmk9qkQS6m5 -QUrwjiPkahvBJVzw3KUw82wRxBnVYSuUnc7KJtFUXeEmQJImkB7u3onYMAdfqHfbHcIAEXFo -88zinRhfYtyijinIy0BzMA9OHOopNZlOd7tsSTHLJoi765QgZGQhRLRPKUsZCNdhqNdTXa5a -LeqW18l4JStrKzF59ATW0nKRWTV0nCT8PFl12UnMnhSHyow7MGoR91jLhYgQzt6E4xw/w5cM -ugNvJQchElVTjoQ7VjrjKDG6VJRBposeJCcmUcxGxX0xoms4La/JSLSToM904QHKhTZveNos -vSk0g8BxVeSiRa96xjQR2ltwm7uhLOI0GzE/m6jF1NfZJqTH7zroUIqsdNJ6YuoGuDxLxS2a -5zQ1vec62Bz7H04u5VMpM1Suecu6cvMVcdpN+F1pXTPG3aV/W0Lb6yX/nvykgiy/Xg7OP1yk -/dNj4KiPh1rLBUnTZlJ102PAyYdvrjSUKw9+ODsevrXYLga/bzGULaqSkSMXG/Ec6jHEQChn -IEZEFJAEsacFZO8cScwhdb1hO3flBMKlzlam2k5FA73O40zzZGu6vAxsu3rR02XvfNTxdUR7 -BpSmm1BnCcOnWIjmgNF71KnDqSBcHNAL3lqCiFLtGfbRL2iDIJG8Ku5lx+5zXRAdfDPhSfbw -lZ5phbvKzKVbfdaWzbP14paJ3QcZUJnoJk2Kv9oQmAH4e0wytbPcIJuRtc/5c8eSiZzNZTbG -kj1FRWVhBLcLhKf9BatsY5kzo5C1PylEpbWfZ4nvTNqJe+9A82Tk3E6GYm1Go0phIVmddkR2 -dOSg9IW936uCUNq6Emi041y0JkllEopnoyErdRg5vFYWS61suUAYnky5ltadVIBmKm8Tx0fE -S29M2TUdIEcsHoXWWsgpeyWJlHXGOXnXF8sokk8qGy0WlIjpBqEl3vNTYYP5HKrXjFaJxlKJ -cFLGJfPcMuK9HjAwvohKZNUS6vbc4c8ud8IkATEiuzpQKBWO4M8wWF1Xs2ae1LEeg+2NlWuo -zcBwyQmZihRYiiKGwHdBk9CVfizNvLhZlst6or0LzyEvF9qVb6wERsCS2CDjp5LmpBnnsUnc -TLJiqsVxXfK/1iqIheLZTLtL9LXaJdatlyiKOeEsIJIACLOYPeYWmk7wDJXIxj6MFIH20rXw -RU0/7YoS4WkrVsFdUkuHymsI6c/vVnUBPJLStR5mN9e0J1XwVtZKuxKI6XxBPYr0LwjdH9wy -d6WZlPO8oRzT7wziiFlV2wnGOaZxtkQ5W0pkBrAQOtydrLhrslTpNFY0ydrbjNAY/LbKKxc2 -uYOEsI8tdCmkIQr3NM8XjkLxEL/L8a+0eGy21xgBitNg0Nt1RhRTIM+/kbXlwsocFVBHkrPk -YkJSIhuT6608R1twDjSCtWWEp0/1dBzXG+MgbRKj6M1G6wWAiZ4sM22LGZtRHBIMMNZ7XgSx -zu9qFXUxtqi9sZZLXZnaXd7CCGppVADMWi8ZVsHpGSKKp7GoRqEVENAuTcBFv07/Zs9V97D0 -Lui14rmwEiB16gZrAPdUlUEMCZ+xyQujFQYb2YS6lKBR/shiKtK2c2GcCJCeJjY0DVJJtGo5 -6mOqRiJpK8IyuSZM8sGqyaaIogSCVnqazcqlcBeFL1II81C0OF66leNlbMC+2G37PIVOO0FJ -INPAAn3YKdBxhBf2GoeFIoVx4lv1jsiBbLW5XWxh/cCYGM0nE5dfaM4BLvdF/rDGE9lKo+E9 -HfyAUsjS1FeOpQsiWyF5RWtT4UFlEwFOGyhBF1+9BLPWkneVibU4kM9mU0MgWGzUbOdaY709 -0dzdb8JnDeiq2KWyTa/sszkeNEYTr3qXiRnoOHE7AzAn+YoqQzuPZpdyyapiyBmoy5m0Rlcu -VKOKGmKjd2jlP6CNF6rO1qbvTWWN7w2vOGudQd1ZQp9xRLsOP4rmWYpoC8N/CCiiFg1NU96J -FHcNp/NyEV5I1oiOmM3QLApmZQr3chajpgmRVs2WJutChYw1VjhNaGkbbhTaW86FkvYKqAO4 -cYeonadKgCvDNcCxeVAkEmxtZd24kql54uoOkS9ofOq0qnycVaMJEXG3xDA/QEyrc+xSXuxG -YQKMlP73RWCYdZMUS8Uo8v9RUa0XSew6coQ7KlhJO6kOVh0B8tzrVHbpjoZD0xXNmyT/Ia/U -/HXHmWVqL6pysnWxIwOqrESdm8Cb4eZUvVUVkDkPieEvNJozJTpsPMYqebMOAuQ8WAt/S0PJ -uq5FBmlXLe/URPa02OR9OVkyYTcBALOsAHRTnt7MT3XfhgtdV87/otEp2yRNw0rZKuUOf1xV -X5/C+uhhQqowdfXn+R5RNdffwafiPnBFKpLfQCPbIn+TCz9xBxzDcwPP7lCihBnAZWZnSl0a -sgKN/tS/Qa4N1BWWk7PdwHeTnLKuUp8yBaGh159BmGOQqkA1RkjXzryf2rh83W5NUGVNezpa -TVQ370ZaK6dZhapQS3cMNU5CCB3Vxl7LEnaDRrY5syycJ6rc3fQ+mxTaHBKVhDsv6H/Tea3y -rGKgpjErqCCRIay6ppCbBjWzBCQa0gzoUTGyCJdbCJB+ijsm1FwXLqbXLqWwrj1bWF/xVl5E -e3Na+0DFTwXwz9uD3euvM/kH9uBmF3XFBWojm5X6qSeeYINU9q/FoXZMGToKvWfZRMYyU35m -aoyFbdU9oDdNzJgMJpxSzLYNd4e7ESD08H4rXSX4an/y8HK+QUHNAtXBLJd1qSzp7WJ57dLh -WldfVBfmrsTm/W3DVNQjpmNhWFC3YxokJx5CMM48tW3LjCk0PY3nZLN40OqRC0dfe0/Yu3bp -8ZiNcQEyVI6WsJWKxmoRy26yrGmZZHVd3hTuEJMjgAqLLBtdhLJ3/rzyYQAQrUCySDWXXxhc -YX4yqj3wkE8mWaw4NDOSWb73BBDodkk9z7njuSuz3Y35xMeFIT5IDfPHJaGsmbt6glIbv/YU -Zru6C61lWaNrWiAJ9mkvTl/5jhrAVCia2ulTnSFGrAmVqprUYON7NsPEUiFwAFb1QlQ3Opl4 -i2xr/rCUZFWXM+otHHPoKjG1PbMT6iUzotUTIX+7oS1ErUPFik4AojXmJyOhIxOhSc8k37Lk -eA1Fkxropjat1t9Koa4La8Yo1xrYoD5Xt6mMsjHkRVDPr5NtamWLS1r+Ubkc30W8vbCIuTo5 -p/Oc6TFbhrDmLooWA1GDNP280Rm0Lh8cQequEfuPTvRQHna7LpEopYJ68x/mcOTSgDJR7+w8 -UlUQzYSDSahivkio4zxQGyx3dr+7d/BPxJWUBhkrypYQAwsTZpAiBTayFffcMqwknENfYKjQ -7bqv6rPiYniYndsLCeEaWuQTDPE3Ry4UVQO/CQPj0eE2wbwBL/YBiD2IQJf873Y5Uc4yKZh+ -Bun1QrfOzbvY2rRLgdoWSF3AKenBaZKOwS3IbMP0LYVHY5hjmPjqtm2Hcs2lJyx8x8aUzIRb -j30o9gYWb+ZWWcUg3V1xXSzUVT/JHkL0vowvOVojo0t4L3A3VBcQGB0Q07CqdtbqhuLrjq9t -TvY9de4g4HgTqEb7b9X+8D3WGwQRpobH0WFGf09gT0cchp+sLeKaiWNQh5eWgYhSlKag/Jiq -/xMzXsSghrUDZMQPE9lPo7O0xAPJ9osiRfQQt32JUYDfxyWn+zstAZosIhhDe80cQmHsqRDJ -YJ7L22XFeFULcBIKprlT/UkajE3PoFMGQLqWpbhjiKuXtE+SIVSszm0OXfBGr6X2E2ghpYgd -cx5rFtkrFANQwU53CrLaPDIAISBW+3fLEespp6qkRNapxpwT0UQhcXJ/6Nb20+MH8NekTzXa -PC0MW2jxas1L2+smERVSGeY6khBAO08N/4JJ6aiYSSEDF3PZO2449Z7LaUD95JgsTNMPXayd -kVbOGcQFnJ/oN4jG3e8q5MLwT3g99umXpo3XQO0IedXFdDmRY5prsEgDGCJDxqZXNlw/icM2 -EVovrxbqfo9eM9G/sYlQvZ0wd5w9C/tvIpMy392AngmVPBUjmlblSsyE1TNCCqLDHekJ3osw -P1V79VrTpo6DhVhGhWYvqts+/CVmJLUKmYdOkZynXRuAo/LlvZZFstr1bR4YFWpARL2C0Aru -IG7yjwxfdbgo6LPhkJL/vMsn0KTVGAaSbqaHMqeWp6KXTeAw3iwnqHtYVDfLqaZrK4e7ziZx -Lm3UfIRETdQp6fEUfygKS6whVw1AOVMSSuJuEUEdtlxu82VFDrbF5yY7szT5zL/01Efok7qB -VcDRL6S6Mu8Z3XUO1DNfnToOChbNZyP0ZuuTr9udM4GdKuOkNUKP8hmSBpMeV9biwmCYjYHd -2mJV+rvBv5qglD85iYr4ucIznPrndMljwdL0A/cxL1HmPkBykjFwHXKsletYN8EUf0AIv2IM -Eui+jSHlo8SpnazLbBKiEY2fay3OutDi6nHerDnTE33ptTlRUUnbwr0EUX02Kme6ASORPiMi -Swm1Sus70gyUQYr3lrMgjNXH1zAjG6TCTwJewtigSUJlxHdlQZ3wcu3UxGRKSBwGil7g3SfA -6cGMxGtZhvxeDwBqY69LK5Wq9WKDPdOI+KLnwbV1P8Vnhnpd41isQuDwCYQPHBxKw6jiHT6F -S6SI+q9XTWQrttOVRzfqyAaWCFyRplfdGsemGUCOno1G6ncAEch2j3M8Pr9jBL01xQj0InJN -Y3GJMuIwla5CM7NF+9VWOoC6c2ZUAlATJmkWQlnHsrYO8hFE4kyDU7jaudvmxaLkl3KCESKp -ydCjIco5F6p0B6OFH6/L0QbKgMrLlyyxsRuKjpVy9EWV3xeM3uqWA9RsN4bUftHIrms3qANA -i8VxwpULaXqBucVt8PCAMEXCF2DuMvZ6XlRRUT+hJxxce0PTIzBCLVGDF/ReD7J4BRz5laaK -oNQwhxAiIZBUrv2yFFkY+Ffhb8QWyh4vZdLgi/6E3rLc4EPdNqY3h3dSZWvPbhgSyiojQJ1J -2g6Yd+sSlk63seIosh2j0TjPIwdqW6F2kJhHCH1QZeWogVZX22+YSbaQw8bcm4CGLsJq2xKs -BclWAcNSup7vr/Da1B+/7ybKyVDo0n7PlUfHoEang7rCBv6EWDjlvzEKtbb4XesErynVoRjS -2q0uKh8Sw9BDfW8saVMNgxQI8ciYzf3Eyv/YJTKtK4CQwlFOcxyyOqE8CE7GOiCeLU0DQozr -7tf3CMmPmrEAMj4uswlPN89ede9kp2oBS5ySpuT9xglQ+9WybeUBtrNqHNMy2OzI/FFsA644 -MDESXhkrP2FRDkt1Oj0Ltwlx/w966ZvBUf/qYsBKRR/Pz96d9z+g5JehYo/Tt+eDQXr2Nj16 -3z9/N+jiufMBnojbAkY2aqCLMjb4e/DXy8HpZfpxcP5heHkprb35lPY/fpTG+29OBulJ/1tZ -zcFfjwYfL9Nv3w9OkzM0/+1QxnNx2ccLw9P02/Ph5fD0ndVS+vjpfPju/WX6/uzkeHBOtO5n -0jtf1KuNBheJjOOb4XF7Up3+hQy7E65W8sFjcrhm6S/D0+NuOhiyocFfP54PLmT+ibQ9/CAj -HsiPw9Ojk6tjAoHfXGlNH1bZk3FennFp/FlvXQYj7W/cyQTk8M+4lIlLKI3Igp8PL/6S9i8S -W9ivr/qhIVldaeND//SIG7W2kZhu+unsClJD5n1yjAcSfwALNUiPB29RNPob2V55Urq5uPow -sPW+uOQCnZykp4MjGW///FN6MTj/ZniEdUjOBx/7Q1l+YKTPz7X0tPKW5z1snlDJ4BvQwNXp -CWZ7Pvj6SuazhRLQRv+dUBsWM9r35NuhdI4dWt/8Ll+RH5rN/yRkhALpnxSY/cnIQ4YZkNtt -qhCiaKiz/+YMa/BGxjPksGQgWBBs0XH/Q//d4KKbBCJg1wYm76YXHwdHQ/yH/C6kJ3t9oqsi -p+jrK+yifGGNpH3ZTkwNdGhbhjMIWjt1GpG+18/l06bvNfoDXZycXYDYpJPLfsoRy7/fDPD0 -+eBU1ovHqX90dHUuRwtP4A0ZzcWVHLbhKTclwXx5mofnx36euM7p2/7w5Op8g8akZ71oc6C0 -FjbEiexir0saSIdvpauj97Z7aevUfkrfy1a8Gchj/eNvhuA82k8iZ+FiaGtyZi3YOpKxMddU -5sfntwD4gf3vzwHOKX74Ck5cyAEtT6t+1ktqAfLlJ7DdU1F5TNbVoGOTjyMRr5Ny3lzz3qAp -oyw3w+qZyBwzC6ReJGKJqLNsWQcppAae2d05K0yt1DN9B0NDVR9Fu1MSFYukLRFUEoa0nY07 -9KKE0BAydiei58W5Y3axyCzw1ChIAdLr+qM6I1KrvFRnt5gaRhzenjb3khqMiDAci7TwWixP -GdU8FEUOippwn68sciUqfG3KWgM5JpAHTbGN+MY5j/lTk+8EpaCDkqXmvErnJe0gLXuXWxIs -AwYG9UMaE9QAg0L+EevJ9x03EC3Ak1pLzGvT12KB3GpVOUKK9EY/YsP/xLbW06pXK2nfa9Sr -4vMn7fUX3ZNo/mNFUf4DdyWma3clBsTez78v0TuJ7ktkKz/rzsRtC/D33ptIyMgvuzsRTfyy -+xNDktHPvkMRb/zyexQVIvGP36WI9zfvU/x5KfxIRgFOCV6BGBYCz5myWy9/CyIWBZmVD6ty -JuPXFEDR91EDfqKuzhZCo4VI7Tov9ESSDMtWBRCv1umF+5qAR9S3XNCIYRZFC9sqByY3BNW7 -mSjV96rNOzm/lKltObvtk7vxthaC5A7031ycnYi2cfIp1pRfkwJs8/VG7L8xW/XhSa85BOun -v5EzZPz5BP1olbgWM2ALljsV/EVugr2Ou7t5Eg+kp1CVu9Uchh3jWg3K28fHMYS3jVo907aV -TdKyG3fmm53dMpRi0Y+mP4aKa3g1V3BoIMbGCLDYZfQoRMlOW4dmuUvqmedpv86TaSlNPruR -EXxPR8Y0ny1lwfJp/ewZuDaN5xr1/9I4x799xSnBeEg/5iO4Z7RcoS6WZ7oH+LG9PUVdWM3d -rpIaJvtEYxszRbAjuIzEucYZ16TcdJrMFNc1UKgUqfG1Zmi+N2R6BtzEfCIigqgpvgMy1fyK -T+WqHK1muZ9oyD8tWmw+8WwSewN5QqCNGMO1zqWhv0V0/gQBMWIE5TTWmsLLUtYOfKn3ghNN -OvszRpO+R6H/igzvjwodQbK3UMnlSk5aOftTNz0QvawqJqw+AgVFf+iiQkddeE7XN0JB5snd -wWSDX8UiRY1PI5RonkfejCTKfA1FBkJYrYpZUYagbFUiJg1mw1ISwSmTOB6cGZlg8iqZGG7U -kYhSQTRX3GOr6qnjUBJr3J1GyhQeHBbqadwjUd48Y2ZLdYtke3WLTWfmb1295vHzSz9R/afh -qZhzJyf//D5+4v6H588P9v3+h8NXuAvw4PDF4eP9f7/KZ+P+h6+vhkd/SY0WNi+AiG9siC6A -D+pP7wWFqvy7dwhFbiiCYWLXQSThOohu+8YIYe4He2l6NUNPUYG5ZzKi3mL8P+o8Vqba9Gnm -4aJiWuRzaeBDA4ahwed7itHMs8XNXagOrY9UdYB3qrEXCrpDqBmCGkH9EhyVuup0rYvm0kNZ -kygZn0/opWahloNDmsF5P51dnadeJ51yt5ce57eolwuO3ZmuRtedXnoUKvV+EJGJHnvJ4Z5d -gqOS1ufF5bHQZpSu3NEHnn3cT/+4uSCd5HNp7Tw3ARwe6KVPj3mBmGgTLJxSsjg664QTGbaX -vJAXj5pyPPPxf91dZz3caG0phaJVBS+Dgrn4p5WxfvpJhO3QK3QYWLteip6117NyWvLeUPP5 -6jUfgzT5FV/g5Vv+gUoizONLIamXXwjl7IO5vHjR83/2+dSikiOfvOS01d8Q7q1SRO+qQXU+ -KwDoBaY8X3gd5730dpKNheJeSRNUjTs0LqT3jtosupSRGlQnX+BZSw1O14C6pKNZKJbaVE7d -CwlJ/NIFuJORQjIKRHFd7fKbUpIvMb2lftlZzjsGqpZzts+rzDDrCZMQUXn70nZX94couspf -RlVQf113revbxvLwSxTJlzZ21H7eWfn5QlSSP4sunh6mz/f3fxVW//h5/Dx+Hj+Pn8fP4+fx -8/h5/Dx+Hj+Pn8fP/9fP/wKykq3cAMgAAA== - ---Boundary_(ID_GeYGc69fE1/bkYLTPwOGFg)-- - -************ - - -From owner-pgsql-hackers@hub.org Mon Jan 3 13:47:07 2000 -Received: from hub.org (hub.org [216.126.84.1]) - by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id OAA23987 - for ; Mon, 3 Jan 2000 14:47:06 -0500 (EST) -Received: from localhost (majordom@localhost) - by hub.org (8.9.3/8.9.3) with SMTP id OAA03234; - Mon, 3 Jan 2000 14:39:56 -0500 (EST) - (envelope-from owner-pgsql-hackers) -Received: by hub.org (bulk_mailer v1.5); Mon, 3 Jan 2000 14:39:49 -0500 -Received: (from majordom@localhost) - by hub.org (8.9.3/8.9.3) id OAA03050 - for pgsql-hackers-outgoing; Mon, 3 Jan 2000 14:38:50 -0500 (EST) - (envelope-from owner-pgsql-hackers@postgreSQL.org) -Received: from ara.zf.jcu.cz (zakkr@ara.zf.jcu.cz [160.217.161.4]) - by hub.org (8.9.3/8.9.3) with ESMTP id OAA02975 - for ; Mon, 3 Jan 2000 14:38:05 -0500 (EST) - (envelope-from zakkr@zf.jcu.cz) -Received: from localhost (zakkr@localhost) - by ara.zf.jcu.cz (8.9.3/8.9.3/Debian/GNU) with SMTP id UAA19297; - Mon, 3 Jan 2000 20:23:35 +0100 -Date: Mon, 3 Jan 2000 20:23:35 +0100 (CET) -From: Karel Zak - Zakkr -To: P.Marchesso@videotron.ca -cc: pgsql-hackers -Subject: [HACKERS] replicator -Message-ID: -MIME-Version: 1.0 -Content-Type: TEXT/PLAIN; charset=US-ASCII -Sender: owner-pgsql-hackers@postgresql.org -Status: OR - - -Hi, - -I look at your (Philippe's) replicator, but I don't good understand -your replication concept. - - - node1: SQL --IPC--> node-broker - | - TCP/IP - | - master-node --IPC--> replikator - | | | - libpq - | | | - node2 node..n - -(Is it right picture?) - -If I good understand, all nodes make connection to master node and data -replicate "replicator" on this master node. But it (master node) is very -critical space in this concept - If master node not work replication for -*all* nodes is lost. Hmm.. but I want use replication for high available -applications... - -IMHO is problem with node registration / authentification on master node. -Why concept is not more upright? As: - - SQL --IPC--> node-replicator - | | | - via libpq send data to all nodes with - current client/backend auth. - - (not exist any master node, all nodes have connection to all nodes) - - -Use replicator as external proces and copy data from SQL to this replicator -via IPC is (your) very good idea. - - Karel - - ----------------------------------------------------------------------- -Karel Zak http://home.zf.jcu.cz/~zakkr/ - -Docs: http://docs.linux.cz (big docs archive) -Kim Project: http://home.zf.jcu.cz/~zakkr/kim/ (process manager) -FTP: ftp://ftp2.zf.jcu.cz/users/zakkr/ (C/ncurses/PgSQL) ------------------------------------------------------------------------ - - -************ - -From owner-pgsql-hackers@hub.org Tue Jan 4 10:31:01 2000 -Received: from renoir.op.net (root@renoir.op.net [207.29.195.4]) - by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id LAA17522 - for ; Tue, 4 Jan 2000 11:31:00 -0500 (EST) -Received: from hub.org (hub.org [216.126.84.1]) by renoir.op.net (o1/$Revision: 1.10 $) with ESMTP id LAA01541 for ; Tue, 4 Jan 2000 11:27:30 -0500 (EST) -Received: from localhost (majordom@localhost) - by hub.org (8.9.3/8.9.3) with SMTP id LAA09992; - Tue, 4 Jan 2000 11:18:07 -0500 (EST) - (envelope-from owner-pgsql-hackers) -Received: by hub.org (bulk_mailer v1.5); Tue, 4 Jan 2000 11:17:58 -0500 -Received: (from majordom@localhost) - by hub.org (8.9.3/8.9.3) id LAA09856 - for pgsql-hackers-outgoing; Tue, 4 Jan 2000 11:17:17 -0500 (EST) - (envelope-from owner-pgsql-hackers@postgreSQL.org) -Received: from ara.zf.jcu.cz (zakkr@ara.zf.jcu.cz [160.217.161.4]) - by hub.org (8.9.3/8.9.3) with ESMTP id LAA09763 - for ; Tue, 4 Jan 2000 11:16:43 -0500 (EST) - (envelope-from zakkr@zf.jcu.cz) -Received: from localhost (zakkr@localhost) - by ara.zf.jcu.cz (8.9.3/8.9.3/Debian/GNU) with SMTP id RAA31673; - Tue, 4 Jan 2000 17:02:06 +0100 -Date: Tue, 4 Jan 2000 17:02:06 +0100 (CET) -From: Karel Zak - Zakkr -To: Philippe Marchesseault -cc: pgsql-hackers -Subject: Re: [HACKERS] replicator -In-Reply-To: <38714B6F.2DECAEC0@Videotron.ca> -Message-ID: -MIME-Version: 1.0 -Content-Type: TEXT/PLAIN; charset=US-ASCII -Sender: owner-pgsql-hackers@postgreSQL.org -Status: OR - - -On Mon, 3 Jan 2000, Philippe Marchesseault wrote: - -> So it could become: -> -> SQL --IPC--> node-replicator -> | | | -> via TCP send statements to each node -> replicator (on local node) -> | -> via libpq send data to -> current (local) backend. -> -> > (not exist any master node, all nodes have connection to all nodes) -> -> Exactly, if the replicator dies only the node dies, everything else keeps -> working. - - - Hi, - - I a little explore replication conception on Oracle and Sybase (in manuals). -(Know anyone some interesting links or publication about it?) - - Firstly, I sure, untimely is write replication to PgSQL now, if we -haven't exactly conception for it. It need more suggestion from more -developers. We need firstly answers for next qestion: - - 1/ How replication concept choose for PG? - 2/ How manage transaction for nodes? (and we need define any - replication protocol for this) - 3/ How involve replication in current PG transaction code? - -My idea (dream:-) is replication that allow you use full read-write on all -nodes and replication which use current transaction method in PG - not is -difference between more backends on one host or more backend on more hosts -- it makes "global transaction consistency". - -Now is transaction manage via ICP (one host), my dream is alike manage -this transaction, but between more host via TCP. (And make optimalization -for this - transfer commited data/commands only.) - - -Any suggestion? - - -------------------- -Note: - -(transaction oriented replication) - - Sybase - I. model (only one node is read-write) - - primary SQL data (READ-WRITE) - | - replication agent (transaction log monitoring) - | - primary distribution server (one or more repl. servers) - | / | \ - | nodes (READ-ONLY) - | - secondary dist. server - / | \ - nodes (READ-ONLY) - - - If primary SQL is read-write and the other nodes *read-only* - => system good work if connection is disable (data are save to - replication-log and if connection is available log is write - to node). - - - Sybase - II. model (all nodes read-write) - - SQL data 1 --->--+ NODE I. - | | - ^ | - | replication agent 1 (transaction log monitoring) - V | - | V - | | - replication server 1 - | - ^ - V - | - replication server 2 NODE II. - | | - ^ +-<-->--- SQL data 2 - | | - replcation agent 2 -<-- - - - -Sorry, I not sure if I re-draw previous picture total good.. - - Karel - - - - - - -************ - -From pgsql-hackers-owner+M3133@hub.org Fri Jun 9 15:02:25 2000 -Received: from hub.org (root@hub.org [216.126.84.1]) - by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id PAA22319 - for ; Fri, 9 Jun 2000 15:02:24 -0400 (EDT) -Received: from hub.org (majordom@localhost [127.0.0.1]) - by hub.org (8.10.1/8.10.1) with SMTP id e59IsET81137; - Fri, 9 Jun 2000 14:54:14 -0400 (EDT) -Received: from ultra2.quiknet.com (ultra2.quiknet.com [207.183.249.4]) - by hub.org (8.10.1/8.10.1) with SMTP id e59IrQT80458 - for ; Fri, 9 Jun 2000 14:53:26 -0400 (EDT) -Received: (qmail 13302 invoked from network); 9 Jun 2000 18:53:21 -0000 -Received: from 18.67.tc1.oro.pmpool.quiknet.com (HELO quiknet.com) (pecondon@207.231.67.18) - by ultra2.quiknet.com with SMTP; 9 Jun 2000 18:53:21 -0000 -Message-ID: <39413D08.A6BDC664@quiknet.com> -Date: Fri, 09 Jun 2000 11:52:57 -0700 -From: Paul Condon -X-Mailer: Mozilla 4.73 [en] (X11; U; Linux 2.2.14-5.0 i686) -X-Accept-Language: en -MIME-Version: 1.0 -To: ohp@pyrenet.fr, pgsql-hackers@postgresql.org -Subject: [HACKERS] Re: Big project, please help -Content-Type: text/plain; charset=us-ascii -Content-Transfer-Encoding: 7bit -X-Mailing-List: pgsql-hackers@postgresql.org -Precedence: bulk -Sender: pgsql-hackers-owner@hub.org -Status: OR - -Two way replication on a single "table" is availabe in Lotus Notes. In -Notes, every record has a time-stamp, which contains the time of the -last update. (It also has a creation timestamp.) During replication, -timestamps are compared at the row/record level, and compared with the -timestamp of the last replication. If, for corresponding rows in two -replicas, the timestamp of one row is newer than the last replication, -the contents of this newer row is copied to the other replica. But if -both of the corresponding rows have newer timestamps, there is a -problem. The Lotus Notes solution is to: - 1. send a replication conflict message to the Notes Administrator, -which message contains full copies of both rows. - 2. copy the newest row over the less new row in the replicas. - 3. there is a mechanism for the Administrator to reverse the default -decision in 2, if the semantics of the message history, or off-line -investigation indicates that the wrong decision was made. - -In practice, the Administrator is not overwhelmed with replication -conflict messages because updates usually only originate at the site -that originally created the row. Or updates fill only fields that were -originally 'TBD'. The full logic is perhaps more complicated than I have -described here, but it is already complicated enough to give you an idea -of what you're really being asked to do. I am not aware of a supplier of -relational database who really supports two way replication at the level -that Notes supports it, but Notes isn't a relational database. - -The difficulty of the position that you appear to be in is that -management might believe that the full problem is solved in brand X -RDBMS, and you will have trouble convincing management that this is not -really true. - - -From pgsql-hackers-owner+M2401@hub.org Tue May 23 12:19:54 2000 -Received: from news.tht.net (news.hub.org [216.126.91.242]) - by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id MAA28410 - for ; Tue, 23 May 2000 12:19:53 -0400 (EDT) -Received: from hub.org (majordom@hub.org [216.126.84.1]) - by news.tht.net (8.9.3/8.9.3) with ESMTP id MAB53304; - Tue, 23 May 2000 12:00:08 -0400 (EDT) - (envelope-from pgsql-hackers-owner+M2401@hub.org) -Received: from gwineta.repas.de (gwineta.repas.de [193.101.49.1]) - by hub.org (8.9.3/8.9.3) with ESMTP id LAA39896 - for ; Tue, 23 May 2000 11:57:31 -0400 (EDT) - (envelope-from kardos@repas-aeg.de) -Received: (from smap@localhost) - by gwineta.repas.de (8.8.8/8.8.8) id RAA27154 - for ; Tue, 23 May 2000 17:57:23 +0200 -Received: from dragon.dr.repas.de(172.30.48.206) by gwineta.repas.de via smap (V2.1) - id xma027101; Tue, 23 May 00 17:56:20 +0200 -Received: from kardos.dr.repas.de ([172.30.48.153]) - by dragon.dr.repas.de (UCX V4.2-21C, OpenVMS V6.2 Alpha); - Tue, 23 May 2000 17:57:24 +0200 -Message-ID: <010201bfc4cf$7334d5a0$99301eac@Dr.repas.de> -From: "Kardos, Dr. Andreas" -To: "Todd M. Shrider" , - -References: -Subject: Re: [HACKERS] failing over with postgresql -Date: Tue, 23 May 2000 17:56:20 +0200 -Organization: repas AEG Automation GmbH -MIME-Version: 1.0 -Content-Type: text/plain; - charset="iso-8859-1" -Content-Transfer-Encoding: 8bit -X-Priority: 3 -X-MSMail-Priority: Normal -X-Mailer: Microsoft Outlook Express 5.00.2314.1300 -X-MimeOLE: Produced By Microsoft MimeOLE V5.00.2314.1300 -X-Mailing-List: pgsql-hackers@postgresql.org -Precedence: bulk -Sender: pgsql-hackers-owner@hub.org -Status: OR - -For a SCADA system (Supervisory Control and Data Akquisition) which consists -of one master and one hot-standby server I have implemented such a -solution. To these UNIX servers client workstations are connected (NT and/or -UNIX). The database client programms run on client and server side. - -When developing this approach I had to goals in mind: -1) Not to get dependend on the PostgreSQL sources since they change very -dynamically. -2) Not to get dependend on the fe/be protocol since there are discussions -around to change it. - -So the approach is quite simple: Forward all database requests to the -standby server on TCP/IP level. - -On both servers the postmaster listens on port 5433 and not on 5432. On -standard port 5432 my program listens instead. This program forks twice for -every incomming connection. The first instance forwards all packets from the -frontend to both backends. The second instance receives the packets from all -backends and forwards the packets from the master backend to the frontend. -So a frontend running on a server machine connects to port 5432 of -localhost. - -On the client machine runs another program (on NT as a service). This -program forks for every incomming connections twice. The first instance -forwards all packets to port 5432 of the current master server and the -second instance forwards the packets from the master server to the frontend. - -During standby computer startup the database of the master computer is -dumped, zipped, copied to the standby computer, unzipped and loaded into -that database. -If a standby startup took place, all client connections are aborted to allow -a login into the standby database. The frontends need to reconnect in this -case. So the database of the standby computer is always in sync. - -The disadvantage of this method is that a query cannot be canceled in the -standby server since the request key of this connections gets lost. But we -can live with that. - -Both programms are able to run on Unix and on (native!) NT. On NT threads -are created instead of forked processes. - -This approach is simple, but it is effective and it works. - -We hope to survive this way until real replication will be implemented in -PostgreSQL. - -Andreas Kardos - ------Ursprüngliche Nachricht----- -Von: Todd M. Shrider -An: -Gesendet: Donnerstag, 18. Mai 2000 17:48 -Betreff: [HACKERS] failing over with postgresql - - -> -> is anyone working on or have working a fail-over implentation for the -> postgresql stuff. i'd be interested in seeing if and how any might be -> dealing with just general issues as well as the database syncing issues. -> -> we are looking to do this with heartbeat and lvs in mind. also if anyone -> is load ballancing their databases that would be cool to talk about to. -> -> --- -> Todd M. Shrider VA Linux Systems -> Systems Engineer -> tshrider@valinux.com www.valinux.com -> - - -From pgsql-hackers-owner+M3662@postgresql.org Tue Jan 23 16:23:34 2001 -Received: from mail.postgresql.org (webmail.postgresql.org [216.126.85.28]) - by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id QAA04456 - for ; Tue, 23 Jan 2001 16:23:34 -0500 (EST) -Received: from mail.postgresql.org (webmail.postgresql.org [216.126.85.28]) - by mail.postgresql.org (8.11.1/8.11.1) with SMTP id f0NLKf004705; - Tue, 23 Jan 2001 16:20:41 -0500 (EST) - (envelope-from pgsql-hackers-owner+M3662@postgresql.org) -Received: from sectorbase2.sectorbase.com ([208.48.122.131]) - by mail.postgresql.org (8.11.1/8.11.1) with SMTP id f0NLAe003753 - for ; Tue, 23 Jan 2001 16:10:40 -0500 (EST) - (envelope-from vmikheev@SECTORBASE.COM) -Received: by sectorbase2.sectorbase.com with Internet Mail Service (5.5.2653.19) - id ; Tue, 23 Jan 2001 12:49:07 -0800 -Message-ID: <8F4C99C66D04D4118F580090272A7A234D32AF@sectorbase1.sectorbase.com> -From: "Mikheev, Vadim" -To: "'dom@idealx.com'" , pgsql-hackers@postgresql.org -Subject: RE: [HACKERS] Re: AW: Re: MySQL and BerkleyDB (fwd) -Date: Tue, 23 Jan 2001 13:10:34 -0800 -MIME-Version: 1.0 -X-Mailer: Internet Mail Service (5.5.2653.19) -Content-Type: text/plain; - charset="iso-8859-1" -Precedence: bulk -Sender: pgsql-hackers-owner@postgresql.org -Status: ORr - -> I had thought that the pre-commit information could be stored in an -> auxiliary table by the middleware program ; we would then have -> to re-implement some sort of higher-level WAL (I thought of the list -> of the commands performed in the current transaction, with a sequence -> number for each of them that would guarantee correct ordering between -> concurrent transactions in case of a REDO). But I fear I am missing - -This wouldn't work for READ COMMITTED isolation level. -But why do you want to log commands into WAL where each modification -is already logged in, hm, correct order? -Well, it has sense if you're looking for async replication but -you need not in two-phase commit for this and should aware about -problems with READ COMMITTED isolevel. - -Back to two-phase commit - it's easiest part of work required for -distributed transaction processing. -Currently we place single commit record to log and transaction is -committed when this record (and so all other transaction records) -is on disk. -Two-phase commit: - -1. For 1st phase we'll place into log "prepared-to-commit" record - and this phase will be accomplished after record is flushed on disk. - At this point transaction may be committed at any time because of - all its modifications are logged. But it still may be rolled back - if this phase failed on other sites of distributed system. - -2. When all sites are prepared to commit we'll place "committed" - record into log. No need to flush it because of in the event of - crash for all "prepared" transactions recoverer will have to - communicate other sites to know their statuses anyway. - -That's all! It is really hard to implement distributed lock- and -communication- managers but there is no problem with logging two -records instead of one. Period. - -Vadim - -From pgsql-hackers-owner+M3665@postgresql.org Tue Jan 23 17:05:26 2001 -Received: from mail.postgresql.org (webmail.postgresql.org [216.126.85.28]) - by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id RAA05972 - for ; Tue, 23 Jan 2001 17:05:24 -0500 (EST) -Received: from mail.postgresql.org (webmail.postgresql.org [216.126.85.28]) - by mail.postgresql.org (8.11.1/8.11.1) with SMTP id f0NM31008120; - Tue, 23 Jan 2001 17:03:01 -0500 (EST) - (envelope-from pgsql-hackers-owner+M3665@postgresql.org) -Received: from candle.pha.pa.us (candle.navpoint.com [162.33.245.46]) - by mail.postgresql.org (8.11.1/8.11.1) with ESMTP id f0NLsU007188 - for ; Tue, 23 Jan 2001 16:54:30 -0500 (EST) - (envelope-from pgman@candle.pha.pa.us) -Received: (from pgman@localhost) - by candle.pha.pa.us (8.9.0/8.9.0) id QAA05300; - Tue, 23 Jan 2001 16:53:53 -0500 (EST) -From: Bruce Momjian -Message-Id: <200101232153.QAA05300@candle.pha.pa.us> -Subject: Re: [HACKERS] Re: AW: Re: MySQL and BerkleyDB (fwd) -In-Reply-To: <8F4C99C66D04D4118F580090272A7A234D32AF@sectorbase1.sectorbase.com> - "from Mikheev, Vadim at Jan 23, 2001 01:10:34 pm" -To: "Mikheev, Vadim" -Date: Tue, 23 Jan 2001 16:53:53 -0500 (EST) -CC: "'dom@idealx.com'" , pgsql-hackers@postgresql.org -X-Mailer: ELM [version 2.4ME+ PL77 (25)] -MIME-Version: 1.0 -Content-Transfer-Encoding: 7bit -Content-Type: text/plain; charset=US-ASCII -Precedence: bulk -Sender: pgsql-hackers-owner@postgresql.org -Status: OR - -[ Charset ISO-8859-1 unsupported, converting... ] -> > I had thought that the pre-commit information could be stored in an -> > auxiliary table by the middleware program ; we would then have -> > to re-implement some sort of higher-level WAL (I thought of the list -> > of the commands performed in the current transaction, with a sequence -> > number for each of them that would guarantee correct ordering between -> > concurrent transactions in case of a REDO). But I fear I am missing -> -> This wouldn't work for READ COMMITTED isolation level. -> But why do you want to log commands into WAL where each modification -> is already logged in, hm, correct order? -> Well, it has sense if you're looking for async replication but -> you need not in two-phase commit for this and should aware about -> problems with READ COMMITTED isolevel. -> - -I believe the issue here is that while SERIALIZABLE ISOLATION means all -queries can be run serially, our default is READ COMMITTED, meaning that -open transactions see committed transactions, even if the transaction -committed after our transaction started. (FYI, see my chapter on -transactions for help, http://www.postgresql.org/docs/awbook.html.) - -To do higher-level WAL, you would have to record not only the queries, -but the other queries that were committed at the start of each command -in your transaction. - -Ideally, you could number every commit by its XID your log, and then -when processing the query, pass the "committed" transaction ids that -were visible at the time each command began. - -In other words, you can replay the queries in transaction commit order, -except that you have to have some transactions committed at specific -points while other transactions are open, i.e.: - -XID Open XIDS Query -500 UPDATE t SET col = 3; -501 500 BEGIN; -501 500 UPDATE t SET col = 4; -501 UPDATE t SET col = 5; -501 COMMIT; - -This is a silly example, but it shows that 500 must commit after the -first command in transaction 501, but before the second command in the -transaction. This is because UPDATE t SET col = 5 actually sees the -changes made by transaction 500 in READ COMMITTED isolation level. - -I am not advocating this. I think WAL is a better choice. I just -wanted to outline how replaying the queries in commit order is -insufficient. - -> Back to two-phase commit - it's easiest part of work required for -> distributed transaction processing. -> Currently we place single commit record to log and transaction is -> committed when this record (and so all other transaction records) -> is on disk. -> Two-phase commit: -> -> 1. For 1st phase we'll place into log "prepared-to-commit" record -> and this phase will be accomplished after record is flushed on disk. -> At this point transaction may be committed at any time because of -> all its modifications are logged. But it still may be rolled back -> if this phase failed on other sites of distributed system. -> -> 2. When all sites are prepared to commit we'll place "committed" -> record into log. No need to flush it because of in the event of -> crash for all "prepared" transactions recoverer will have to -> communicate other sites to know their statuses anyway. -> -> That's all! It is really hard to implement distributed lock- and -> communication- managers but there is no problem with logging two -> records instead of one. Period. - -Great. - - --- - Bruce Momjian | http://candle.pha.pa.us - pgman@candle.pha.pa.us | (610) 853-3000 - + If your life is a hard drive, | 830 Blythe Avenue - + Christ can be your backup. | Drexel Hill, Pennsylvania 19026 - -From pgsql-general-owner+M805@postgresql.org Tue Nov 21 23:53:04 2000 -Received: from mail.postgresql.org (webmail.postgresql.org [216.126.85.28]) - by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id AAA19262 - for ; Wed, 22 Nov 2000 00:53:03 -0500 (EST) -Received: from mail.postgresql.org (webmail.postgresql.org [216.126.85.28]) - by mail.postgresql.org (8.11.1/8.11.1) with SMTP id eAM5qYs47249; - Wed, 22 Nov 2000 00:52:34 -0500 (EST) - (envelope-from pgsql-general-owner+M805@postgresql.org) -Received: from racerx.cabrion.com (racerx.cabrion.com [166.82.231.4]) - by mail.postgresql.org (8.11.1/8.11.1) with ESMTP id eAM5lJs46653 - for ; Wed, 22 Nov 2000 00:47:19 -0500 (EST) - (envelope-from rob@cabrion.com) -Received: from cabrionhome (gso163-25-211.triad.rr.com [24.163.25.211]) - by racerx.cabrion.com (8.8.7/8.8.7) with SMTP id AAA13731 - for ; Wed, 22 Nov 2000 00:45:20 -0500 -Message-ID: <006501c05447$fb9aa0c0$4100fd0a@cabrion.org> -From: "rob" -To: -Subject: [GENERAL] Synchronization Toolkit -Date: Wed, 22 Nov 2000 00:49:29 -0500 -MIME-Version: 1.0 -Content-Type: multipart/mixed; - boundary="----=_NextPart_000_0062_01C0541E.125CAF30" -X-Priority: 3 -X-MSMail-Priority: Normal -X-Mailer: Microsoft Outlook Express 5.50.4133.2400 -X-MimeOLE: Produced By Microsoft MimeOLE V5.50.4133.2400 -Precedence: bulk -Sender: pgsql-general-owner@postgresql.org -Status: OR - -This is a multi-part message in MIME format. - -------=_NextPart_000_0062_01C0541E.125CAF30 -Content-Type: text/plain; charset="iso-8859-1" -Content-Transfer-Encoding: 7bit - -Not to be confused with replication, my concept of synchronization is to -manage changes between a server table (or tables) and one or more mobile, -disconnected databases (i.e. PalmPilot, laptop, etc.). - -I read through the notes in the TODO for this topic and devised a tool kit -for doing synchronization. I hope that the Postgresql development community -will find this useful and will help me refine this concept by offering -insight, experience and some good old fashion hacking if you are so -inclined. - -The bottom of this message describes how to use the attached files. - -I look forward to your feedback. - ---rob - - -Methodology: - -I devised a concept that I call "session versioning". This means that every -time a row changes it does NOT get a new version. Rather it gets stamped -with the current session version common to all published tables. Clients, -when they connect for synchronization, will immediately increment this -common version number reserve the result as a "post version" and then -increment the session version again. This version number, implemented as a -sequence, is common to all synchronized tables and rows. - -Any time the server makes changes to the row gets stamped with the current -session version, when the client posts its changes it uses the reserved -"post version". The client then makes all it's changes stamping the changed -rows with it's reserved "post version" rather than the current version. The -reason why is explained later. It is important that the client post all its -own changes first so that it does not end up receiving records which changed -since it's last session that it is about to update anyway. - -Reserving the post version is a two step process. First, the number is -simply stored in a variable for later use. Second, the value is added to a -lock table (last_stable) to indicate to any concurrent sessions that rows -with higher version numbers are to be considered "unstable" at the moment -and they should not attempt to retrieve them at this time. Each client, -upon connection, will use the lowest value in this lock table (max_version) -to determine the upper boundary for versions it should retrieve. The lower -boundary is simply the previous session's "max_version" plus one. Thus -when the client retrieves changes is uses the following SQL "where" -expression: - -WHERE row_version >= max_version and row_version <= last_stable_version and -version <> this_post_version - -The point of reserving and locking a post version is important in that it -allows concurrent synchronization by multiple clients. The first, of many, -clients to connect basically dictates to all future clients that they must -not take any rows equal to or greater than the one which it just reserved -and locked. The reason the session version is incremented a second time is -so that the server may continue to post changes concurrent with any client -changes and be certain that these concurrent server changes will not taint -rows the client is about to retrieve. Once the client is finished with it's -session it removes the lock on it's post version. - -Partitioning data for use by each node is the next challenge we face. How -can we control which "slice" of data each client receives? A slice can be -horizontal or vertical within a table. Horizontal slices are easy, it's -just the where clause of an SQL statement that says "give me the rows that -match X criteria". We handle this by storing and appending a where clause -to each client's retrieval statement in addition to where clause described -above. Actually, two where clauses are stored and appended. One is per -client and one is per publication (table). - -We defined horizontal slices by filtering rows. Vertical slices are limits -by column. The tool kit does provide a mechanism for pseudo vertical -partitioning. When a client is "subscribed" to a publication, the toolkit -stores what columns that node is to receive during a session. These are -stored in the subscribed_cols table. While this does limit the number -columns transmitted, the insert/update/delete triggers do not recognize -changes based on columns. The "pseudo" nature of our vertical partitioning -is evident by example: - -Say you have a table with name, address and phone number as columns. You -restrict a client to see only name and address. This means that phone -number information will not be sent to the client during synchronization, -and the client can't attempt to alter the phone number of a given entry. -Great, but . . . if, on the server, the phone number (but not the name or -address) is changed, the entire row gets marked with a new version. This -means that the name and address will get sent to the client even though they -didn't change. - -Well, there's the flaw in vertical partitioning. Other than wasting -bandwidth, the extra row does no harm to the process. The workaround for -this is to highly normalize your schema when possible. - -Collisions are the next crux one encounters with synchronization. When two -clients retrieve the same row and both make (different)changes, which one is -correct? So far the system operates totally independent of time. This is -good because it doesn't rely on the server or client to keep accurate time. -We can just ignore time all together, but then we force our clients to -synchronize on a strict schedule in order to avoid (or reduce) collisions. -If every node synchronized immediately after making changes we could just -stop here. Unfortunately this isn't reality. Reality dictates that of two -clients: Client A & B will each pick up the same record on Monday. A will -make changes on Monday, then leave for vacation. B will make changes on -Wednesday because new information was gathered in A's absence. Client B -posts those changes Wednesday. Meanwhile, client A returns from vacation on -Friday and synchronizes his changes. A over writes B's changes even though -A made changes before the most recent information was posted by B. - -It is clear that we need some form of time stamp to cope with the above -example. While clocks aren't the most reliable, they are the only common -version control available to solve this problem. The system is set up to -accept (but not require) timestamps from clients and changes on the server -are time stamped. The system, when presented a time stamp with a row, will -compare them to figure out who wins in a tie. The system makes certain -"sanity" checks with regard to these time stamps. A client may not attempt -to post a change with a timestamp that is more than one hour in the future -(according to what the server thinks "now" is) nor one hour before it's last -synchronization date/time. The client row will be immediately placed into -the collision table if the timestamp is that far out of whack. -Implementations of the tool kit should take care to ensure that client & -server agree on what "now" is before attempting to submit changes with -timestamps. - -Time stamps are not required. Should a client be incapable of tracking -timestamps, etc. The system will assume that any server row which has been -changed since the client's last session will win a tie. This is quite error -prone, so timestamps are encouraged where possible. - -Inserts pose an interesting challenge. Since multiple clients cannot share -a sequence (often used as a primary key) while disconnected. They will be -responsible for their own unique "row_id" when inserting records. Inserts -accept any arbitrary key, and write back to the client a special kind of -update that gives the server's row_id. The client is responsible for making -sure that this update takes place locally. - -Deletes are the last portion of the process. When deletes occur, the -row_id, version, etc. are stored in a "deleted" table. These entries are -retrieved by the client using the same version filter as described above. -The table is pruned at the end of each session by deleting all records with -versions that are less than the lowest 'last_version' stored for each -client. - -Having wrapped up the synchronization process, I'll move on to describe some -points about managing clients, publications and the like. - -The tool kit is split into two objects: SyncManagement and Synchronization. -The Synchronization object exposes an API that client implementations use to -communicate and receive changes. The management functions handle system -install and uninstall in addition to publication of tables and client -subscriptions. - -Installation and uninstallation are handled by their corresponding functions -in the API. All system tables are prefixed and suffixed with four -underscores, in hopes that this avoids conflict with an existing tables. -Calling the install function more than once will generate an error message. -Uninstall will remove all related tables, sequences, functions and triggers -from the system. - -The first step, after installing the system, is to publish a table. A table -can be published more than once under different names. Simply provide a -unique name as the second argument to the publish function. Since object -names are restricted to 32 characters in Postgres, each table is given a -unique id and this id is used to create the trigger and sequence names. -Since one table can be published multiple times, but only needs one set of -triggers and one sequence for change management a reference count is kept so -that we know when to add/drop triggers and functions. By default, all -columns are published, but the third argument to the publish function -accepts an array reference of column names that allows you to specify a -limited set. Information about the table is stored in the "tables" table, -info about the publication is in the "publications" table and column names -are stored in "subscribed_cols" table. - -The next step is to subscribe a client to a table. A client is identified -by a user name and a node name. The subscribe function takes three -arguments: user, node & publication. The subscription process writes an -entry into the "subscribed" table with default values. Of note, the -"RefreshOnce" attribute is set to true whenever a table is published. This -indicates to the system that a full table refresh should be sent the next -time the client connects even if the client requests synchronization rather -than refresh. - -The toolkit does not, yet, provide a way to manage the whereclause stored at -either the publication or client level. To use or test this feature, you -will need to set the whereclause attributes manually. - -Tables and users can be unpublished and unsubscribed using the corresponding -functions within the tool kit's management interface. Because postgres -lacks an "ALTER TABLE DROP COLUMN" function, the unpublish function only -removes default values and indexes for those columns. - -The API isn't the most robust thing in the world right now. All functions -return undef on success and an error string otherwise (like DBD). I hope to -clean up the API considerably over the next month. The code has not been -field tested at this time. - - -The files attached are: - -1) SynKit.pm (A perl module that contains install/uninstall functions and a -simple api for synchronization & management) - -2) sync_install.pl (Sample code to demonstrate the installation, publishing -and subscribe process) - -3) sync_uninstall.pl (Sample code to demonstrate the uninstallation, -unpublishing and unsubscribe process) - - -To use them on Linux (don't know about Win32 but should work fine): - - - set up a test database and make SURE plpgsql is installed - - - install perl 5.05 along with Date::Parse(TimeDate-1.1) , DBI and DBD::Pg -modules [www.cpan.org] - - - copy all three attached files to a test directory - - - cd to your test directory - - - edit all three files and change the three DBI variables to suit your -system (they are clearly marked) - - - % perl sync_install.pl - - - check out the tables, functions & triggers installed - - - % perl sync.pl - - - check out the 'sync_test' table, do some updates/inserts/deletes and run -sync.pl again - NOTE: Sanity checks default to allow no more than 50% of the table -to be changed by the client in a single session. - If you delete all (or most of) the rows you will get errors when -you run sync.pl again! (by design) - - - % perl sync_uninstall.pl (when you are done) - - - check out the sample scripts and the perl module code (commented, but -not documented) - - - - - - -------=_NextPart_000_0062_01C0541E.125CAF30 -Content-Type: application/octet-stream; name="sync.pl" -Content-Transfer-Encoding: quoted-printable -Content-Disposition: attachment; filename="sync.pl" - - - -# This script depicts the syncronization process for two users. - - -## CHANGE THESE THREE VARIABLE TO MATCH YOUR SYSTEM ########### -my $dbi_connect_string =3D 'dbi:Pg:dbname=3Dtest;host=3Dsnoopy'; # -my $db_user =3D 'test'; # -my $db_pass =3D 'test'; # -################################################################# - -my $ret; #holds return value - -use SynKit; - -#create a synchronization object (pass dbi connection info) -my $s =3D Synchronize->new($dbi_connect_string,$db_user,$db_pass); - -#start a session by passing a user name, "node" identifier and a collision = -queue name (client or server) -$ret =3D $s->start_session('JOE','REMOTE_NODE_NAME','server'); -print "Handle this error: $ret\n\n" if $ret; - -#call this once before attempting to apply individual changes -$ret =3D $s->start_changes('sync_test',['name']); -print "Handle this error: $ret\n\n" if $ret; - -#call this for each change the client wants to make to the database -$ret =3D $s->apply_change(CLIENTROWID,'insert',undef,['ted']); -print "Handle this error: $ret\n\n" if $ret; - -#call this for each change the client wants to make to the database -$ret =3D $s->apply_change(CLIENTROWID,'insert','1973-11-10 11:25:00 AM -05= -',['tim']); -print "Handle this error: $ret\n\n" if $ret; - -#call this for each change the client wants to make to the database -$ret =3D $s->apply_change(999,'update',undef,['tom']); -print "Handle this error: $ret\n\n" if $ret; - -#call this for each change the client wants to make to the database -$ret =3D $s->apply_change(1,'update',undef,['tom']); -print "Handle this error: $ret\n\n" if $ret; - -#call this once after all changes have been submitted -$ret =3D $s->end_changes(); -print "Handle this error: $ret\n\n" if $ret; - -#call this to get updates from all subscribed tables -$ret =3D $s->get_all_updates(); -print "Handle this error: $ret\n\n" if $ret; - -print "\n\nSyncronization session is complete. (JOE) \n\n"; - - -# make some changes to the database (server perspective) - -print "\n\nMaking changes to the the database. (server side) \n\n"; - -use DBI; -my $dbh =3D DBI->connect($dbi_connect_string,$db_user,$db_pass); - -$dbh->do("insert into sync_test values ('roger')"); -$dbh->do("insert into sync_test values ('john')"); -$dbh->do("insert into sync_test values ('harry')"); -$dbh->do("delete from sync_test where name =3D 'roger'"); -$dbh->do("update sync_test set name =3D 'tom' where name =3D 'harry'"); - -$dbh->disconnect; - - -#now do another session for a different user - -#start a session by passing a user name, "node" identifier and a collision = -queue name (client or server) -$ret =3D $s->start_session('KEN','ANOTHER_REMOTE_NODE_NAME','server'); -print "Handle this error: $ret\n\n" if $ret; - -#call this to get updates from all subscribed tables -$ret =3D $s->get_all_updates(); -print "Handle this error: $ret\n\n" if $ret; - -print "\n\nSynchronization session is complete. (KEN)\n\n"; - -print "Now look at your database and see what happend, make changes to the = -test table, etc. and run this again.\n\n"; - -------=_NextPart_000_0062_01C0541E.125CAF30 -Content-Type: application/octet-stream; name="sync_uninstall.pl" -Content-Transfer-Encoding: quoted-printable -Content-Disposition: attachment; filename="sync_uninstall.pl" - - -# this script uninstalls the synchronization system using the SyncManager o= -bject; - -use SynKit; - -### CHANGE THESE TO MATCH YOUR SYSTEM ######################## -my $dbi_connect_string =3D 'dbi:Pg:dbname=3Dtest;host=3Dsnoopy'; # -my $db_user =3D 'test'; # -my $db_pass =3D 'test'; # -################################################################# - - -my $ret; #holds return value - -#create an instance of the SyncManager object -my $m =3D SyncManager->new($dbi_connect_string,$db_user,$db_pass); - -# call this to unsubscribe a user/node (not necessary if you are uninstalli= -ng) -print $m->unsubscribe('KEN','ANOTHER_REMOTE_NODE_NAME','sync_test'); - -#call this to unpublish a table (not necessary if you are uninstalling) -print $m->unpublish('sync_test'); - -#call this to uninstall the syncronization system -# NOTE: this will automatically unpublish & unsubscribe all users -print $m->UNINSTALL; - -# now let's drop our little test table -use DBI; -my $dbh =3D DBI->connect($dbi_connect_string,$db_user,$db_pass); -$dbh->do("drop table sync_test"); -$dbh->disconnect; - -print "\n\nI hope you enjoyed this little demonstration\n\n"; - - - -------=_NextPart_000_0062_01C0541E.125CAF30 -Content-Type: application/octet-stream; name="sync_install.pl" -Content-Transfer-Encoding: quoted-printable -Content-Disposition: attachment; filename="sync_install.pl" - - -# This script shows how to install the synchronization system=20 -# using the SyncManager object - -use SynKit; - -### CHANGE THESE TO MATCH YOUR SYSTEM ########################## -my $dbi_connect_string =3D 'dbi:Pg:dbname=3Dtest;host=3Dsnoopy'; # -my $db_user =3D 'test'; # -my $db_pass =3D 'test'; # -################################################################# -my $ret; #holds return value - - -#create an instance of the sync manager object -my $m =3D SyncManager->new($dbi_connect_string,$db_user,$db_pass); - -#Call this to install the syncronization management tables, etc. -$ret =3D $m->INSTALL; -die "Handle this error: $ret\n\n" if $ret; - - - -#create a test table for us to demonstrate with -use DBI; -my $dbh =3D DBI->connect($dbi_connect_string,$db_user,$db_pass); -$dbh->do("create table sync_test (name text)"); -$dbh->do("insert into sync_test values ('rob')"); -$dbh->do("insert into sync_test values ('rob')"); -$dbh->do("insert into sync_test values ('rob')"); -$dbh->do("insert into sync_test values ('ted')"); -$dbh->do("insert into sync_test values ('ted')"); -$dbh->do("insert into sync_test values ('ted')"); -$dbh->disconnect; - - - - -#call this to "publish" a table -$ret =3D $m->publish('sync_test'); -print "Handle this error: $ret\n\n" if $ret; - -#call this to "subscribe" a user/node to a publication (table) -$ret =3D $m->subscribe('JOE','REMOTE_NODE_NAME','sync_test'); -print "Handle this error: $ret\n\n" if $ret; - -#call this to "subscribe" a user/node to a publication (table) -$ret =3D $m->subscribe('KEN','ANOTHER_REMOTE_NODE_NAME','sync_test'); -print "Handle this error: $ret\n\n" if $ret; - - -print "Now you can do: 'perl sync.pl' a few times to play\n\n"; -print "Do 'perl sync_uninstall.pl' to uninstall the system\n"; - - -------=_NextPart_000_0062_01C0541E.125CAF30 -Content-Type: application/octet-stream; name="SynKit.pm" -Content-Transfer-Encoding: quoted-printable -Content-Disposition: attachment; filename="SynKit.pm" - -# Perl DB synchronization toolkit - -#created for postgres 7.0.2 + -use strict; - -BEGIN { - use vars qw($VERSION); - # set the version for version checking - $VERSION =3D 1.00; -} - - -package Synchronize; - -use DBI; - -use Date::Parse; - -# new requires 3 arguments: dbi connection string, plus the corresponding u= -sername and password to get connected to the database -sub new { - my $proto =3D shift; - my $class =3D ref($proto) || $proto; - my $self =3D {}; - - my $dbi =3D shift; - my $user =3D shift; - my $pass =3D shift; - - $self->{DBH} =3D DBI->connect($dbi,$user,$pass) || die "Failed to connect = -to database: ".DBI->errstr(); - - $self->{user} =3D undef; - $self->{node} =3D undef; - $self->{status} =3D undef; # holds status of table update portion of sessi= -on - $self->{pubs} =3D {}; #holds hash of pubs available to sessiom with val = -=3D 1 if ok to request sync - $self->{orderpubs} =3D undef; #holds array ref of subscribed pubs ordered = -by sync_order - $self->{this_post_ver} =3D undef; #holds the version number under which th= -is session will post changes - $self->{max_ver} =3D undef; #holds the maximum safe version for getting up= -dates - $self->{current} =3D {}; #holds the current publication info to which chan= -ges are being applied - $self->{queue} =3D 'server'; # tells collide function what to do with coll= -isions. (default is to hold on server) - - $self->{DBLOG}=3D DBI->connect($dbi,$user,$pass) || die "cannot log to DB:= - ".DBI->errstr();=20 - - - return bless ($self, $class); -} - -sub dblog {=20 - my $self =3D shift; - my $msg =3D $self->{DBLOG}->quote($_[0]); - my $quser =3D $self->{DBH}->quote($self->{user}); - my $qnode =3D $self->{DBH}->quote($self->{node}); - $self->{DBLOG}->do("insert into ____sync_log____ (username, nodename,stamp= -, message) values($quser, $qnode, now(), $msg)"); -} - - -#start_session establishes session wide information and other housekeeping = -chores - # Accepts username, nodename and queue (client or server) as arguments; - -sub start_session { - my $self =3D shift; - $self->{user} =3D shift || die 'Username is required'; - $self->{node} =3D shift || die 'Nodename is required'; - $self->{queue} =3D shift; - - - if ($self->{queue} ne 'server' && $self->{queue} ne 'client') { - die "You must provide a queue argument of either 'server' or 'client'"; - } - - my $quser =3D $self->{DBH}->quote($self->{user}); - my $qnode =3D $self->{DBH}->quote($self->{node}); - - my $sql =3D "select pubname from ____subscribed____ where username =3D $qu= -ser and nodename =3D $qnode"; - my @pubs =3D $self->GetColList($sql); - - return 'User/Node has no subscriptions!' if !defined(@pubs); - - # go though the list and check permissions and rules for each - foreach my $pub (@pubs) { - my $qpub =3D $self->{DBH}->quote($pub); - my $sql =3D "select disabled, pubname, fullrefreshonly, refreshonce,post_= -ver from ____subscribed____ where username =3D $quser and pubname =3D $qpub= - and nodename =3D $qnode"; - my $sth =3D $self->{DBH}->prepare($sql) || die $self->{DBH}->errstr; - $sth->execute || die $self->{DBH}->errstr; - my @row; - while (@row =3D $sth->fetchrow_array) { - next if $row[0]; #publication is disabled - next if !defined($row[1]); #publication does not exist (should never occ= -ur) - if ($row[2] || $row[3]) { #refresh of refresh once flag is set - $self->{pubs}->{$pub} =3D 0; #refresh only - next; - } - if (!defined($row[4])) { #no previous session exists, must refresh - $self->{pubs}->{$pub} =3D 0; #refresh only - next; - } - $self->{pubs}->{$pub} =3D 1; #OK for sync - } - $sth->finish; - } - - - $sql =3D "select pubname from ____publications____ order by sync_order"; - my @op =3D $self->GetColList($sql); - my @orderpubs; - - #loop through ordered pubs and remove non subscribed publications - foreach my $pub (@op) { - push @orderpubs, $pub if defined($self->{pubs}->{$pub}); - } -=09 - $self->{orderpubs} =3D \@orderpubs; - -# Now we obtain a session version number, etc. - - $self->{DBH}->{AutoCommit} =3D 0; #allows "transactions" - $self->{DBH}->{RaiseError} =3D 1; #script [or eval] will automatically die= - on errors - - eval { #start DB transaction - - #lock the version sequence until we determin that we have gotten - #a good value. Lock will be released on commit. - $self->{DBH}->do('lock ____version_seq____ in access exclusive mode'); - - # remove stale locks if they exist - my $sql =3D "delete from ____last_stable____ where username =3D $quser an= -d nodename =3D $qnode"; - $self->{DBH}->do($sql); - - # increment version sequence & grab the next val as post_ver - my $sql =3D "select nextval('____version_seq____')"; - my $sth =3D $self->{DBH}->prepare($sql); - $sth->execute; - ($self->{this_post_ver}) =3D $sth->fetchrow_array(); - $sth->finish; - # grab max_ver from last_stable - - $sql =3D "select min(version) from ____last_stable____";=20 - $sth =3D $self->{DBH}->prepare($sql); - $sth->execute; - ($self->{max_ver}) =3D $sth->fetchrow_array(); - $sth->finish; - - # if there was no version in lock table, then take the ID that was in use - # when we started the session ($max_ver -1) - - $self->{max_ver} =3D $self->{this_post_ver} -1 if (!defined($self->{max_v= -er})); - - # lock post_ver by placing it in last_stable - $self->{DBH}->do("insert into ____last_stable____ (version, username, nod= -ename) values ($self->{this_post_ver}, $quser,$qnode)"); - - # increment version sequence again (discard result) - $sql =3D "select nextval('____version_seq____')"; - $sth =3D $self->{DBH}->prepare($sql); - $sth->execute; - $sth->fetchrow_array(); - $sth->finish; - - }; #end eval/transaction - - if ($@) { # part of transaction failed - return 'Start session failed'; - $self->{DBH}->rollback; - } else { # all's well commit block - $self->{DBH}->commit; - } - $self->{DBH}->{AutoCommit} =3D 1; - $self->{DBH}->{RaiseError} =3D 0; - - return undef; - -} - -#start changes should be called once before applying individual change requ= -ests - # Requires publication and ref to columns that will be updated as arguments -sub start_changes { - my $self =3D shift; - my $pub =3D shift || die 'Publication is required'; - my $colref =3D shift || die 'Reference to column array is required'; - - $self->{status} =3D 'starting'; - - my $qpub =3D $self->{DBH}->quote($pub); - my $quser =3D $self->{DBH}->quote($self->{user}); - my $qnode =3D $self->{DBH}->quote($self->{node}); - - my @cols =3D @{$colref}; - my @subcols =3D $self->GetColList("select col_name from ____subscribed_col= -s____ where username =3D $quser and nodename =3D $qnode and pubname =3D $qp= -ub"); - my %subcols; - foreach my $col (@subcols) { - $subcols{$col} =3D 1; - } - foreach my $col (@cols) {=09 - return "User/node is not subscribed to column '$col'" if !$subcols{$col}; - } - - my $sql =3D "select pubname, readonly, last_session, post_ver, last_ver, w= -hereclause, sanity_limit,=20 -sanity_delete, sanity_update, sanity_insert from ____subscribed____ where u= -sername =3D $quser and pubname =3D $qpub and nodename =3D $qnode"; - my ($junk, $readonly, $last_session, $post_ver, $last_ver, $whereclause, $= -sanity_limit,=20 -$sanity_delete, $sanity_update, $sanity_insert) =3D $self->GetOneRow($sql); -=09 - return 'Publication is read only' if $readonly; - - $sql =3D "select whereclause from ____publications____ where pubname =3D $= -qpub"; - my ($wc) =3D $self->GetOneRow($sql); - $whereclause =3D '('.$whereclause.')' if $whereclause; - $whereclause =3D $whereclause.' and ('.$wc.')' if $wc; - - my ($table) =3D $self->GetOneRow("select tablename from ____publications__= -__ where pubname =3D $qpub"); - - return 'Publication is not registered correctly' if !defined($table); - - my %info; - $info{pub} =3D $pub; - $info{whereclause} =3D $whereclause; - $info{post_ver} =3D $post_ver; - $last_session =3D~ s/([+|-]\d\d?)$/ $1/; #put a space before timezone=09 - $last_session =3D str2time ($last_session); #convert to perltime (seconds = -since 1970) - $info{last_session} =3D $last_session; - $info{last_ver} =3D $last_ver; - $info{table} =3D $table; - $info{cols} =3D \@cols; - - my $sql =3D "select count(oid) from $table"; - $sql =3D $sql .' '.$whereclause if $whereclause; - my ($rowcount) =3D $self->GetOneRow($sql); - - #calculate sanity levels (convert from % to number of rows) - # limits defined as less than 1 mean no limit - $info{sanitylimit} =3D $rowcount * ($sanity_limit / 100) if $sanity_limit = -> 0; - $info{insertlimit} =3D $rowcount * ($sanity_insert / 100) if $sanity_inser= -t > 0; - $info{updatelimit} =3D $rowcount * ($sanity_update / 100) if $sanity_updat= -e > 0; - $info{deletelimit} =3D $rowcount * ($sanity_delete / 100) if $sanity_delet= -e > 0; - - $self->{sanitycount} =3D 0; - $self->{updatecount} =3D 0; - $self->{insertcount} =3D 0; - $self->{deletecount} =3D 0; - - $self->{current} =3D \%info; - - $self->{DBH}->{AutoCommit} =3D 0; #turn on transaction behavior so we can = -roll back on sanity limits, etc. - - $self->{status} =3D 'ready'; - - return undef; -} - -#call this once all changes are submitted to commit them; -sub end_changes { - my $self =3D shift; - return undef if $self->{status} ne 'ready'; - $self->{DBH}->commit; - $self->{DBH}->{AutoCommit} =3D 1; - $self->{status} =3D 'success'; - return undef; -} - -#call apply_change once for each row level client update - # Accepts 4 params: rowid, action, timestamp and reference to data array - # Note: timestamp can be undef, data can be undef - # timestamp MUST be in perl time (secs since 1970) - -#this routine checks basic timestamp info and sanity limits, then passes th= -e info along to do_action() for processing -sub apply_change { - my $self =3D shift; - my $rowid =3D shift || return 'Row ID is required'; #don't die just for on= -e bad row - my $action =3D shift || return 'Action is required'; #don't die just for o= -ne bad row - my $timestamp =3D shift; - my $dataref =3D shift; - $action =3D lc($action); - - $timestamp =3D str2time($timestamp) if $timestamp; - - return 'Status failure, cannot accept changes: '.$self->{status} if $self-= ->{status} ne 'ready'; - - my %info =3D %{$self->{current}}; - - $self->{sanitycount}++; - if ($info{sanitylimit} && $self->{sanitycount} > $info{sanitylimit}) { - # too many changes from client - my $ret =3D $self->sanity('limit'); - return $ret if $ret; - } - -=09 - if ($timestamp && $timestamp > time() + 3600) { # current time + one hour - #client's clock is way off, cannot submit changes in future - my $ret =3D $self->collide('future', $info{table}, $rowid, $action, undef= -, $timestamp, $dataref, $self->{queue}); - return $ret if $ret; - } - - if ($timestamp && $timestamp < $info{last_session} - 3600) { # last sessio= -n time less one hour - #client's clock is way off, cannot submit changes that occured before las= -t sync date - my $ret =3D $self->collide('past', $info{table}, $rowid, $action, undef, = -$timestamp, $dataref , $self->{queue}); - return $ret if $ret; - } - - my ($crow, $cver, $ctime); #current row,ver,time - if ($action ne 'insert') { - my $sql =3D "select ____rowid____, ____rowver____, ____stamp____ from $in= -fo{table} where ____rowid____ =3D $rowid"; - ($crow, $cver, $ctime) =3D $self->GetOneRow($sql); - if (!defined($crow)) { - my $ret =3D $self->collide('norow', $info{table}, $rowid, $action, undef= -, $timestamp, $dataref , $self->{queue}); - return $ret if $ret;=09=09 - } - - $ctime =3D~ s/([+|-]\d\d?)$/ $1/; #put space between timezone - $ctime =3D str2time($ctime) if $ctime; #convert to perl time - - if ($timestamp) { - if ($ctime < $timestamp) { - my $ret =3D $self->collide('time', $info{table}, $rowid, $action, undef= -, $timestamp, $dataref, $self->{queue} );=09=09 - return $ret if $ret; - } - - } else { - if ($cver > $self->{this_post_ver}) { - my $ret =3D $self->collide('version', $info{table}, $rowid, $action, un= -def, $timestamp, $dataref, $self->{queue} ); - return $ret if $ret; - } - } -=09 - } - - if ($action eq 'insert') { - $self->{insertcount}++; - if ($info{insertlimit} && $self->{insertcount} > $info{insertlimit}) { - # too many changes from client - my $ret =3D $self->sanity('insert'); - return $ret if $ret; - } - - my $qtable =3D $self->{DBH}->quote($info{table}); - my ($rowidsequence) =3D '_'.$self->GetOneRow("select table_id from ____ta= -bles____ where tablename =3D $qtable").'__rowid_seq'; - return 'Table incorrectly registered, cannot get rowid sequence name: '.$= -self->{DBH}->errstr() if not defined $rowidsequence; - - my @data; - foreach my $val (@{$dataref}) { - push @data, $self->{DBH}->quote($val); - } - my $sql =3D "insert into $info{table} ("; - if ($timestamp) { - $sql =3D $sql . join(',',@{$info{cols}}) . ',____rowver____, ____stamp__= -__) values ('; - $sql =3D $sql . join (',',@data) .','.$self->{this_post_ver}.',\''.local= -time($timestamp).'\')'; - } else { - $sql =3D $sql . join(',',@{$info{cols}}) . ',____rowver____) values ('; - $sql =3D $sql . join (',',@data) .','.$self->{this_post_ver}.')'; - } - my $ret =3D $self->{DBH}->do($sql); - if (!$ret) { - my $ret =3D $self->collide($self->{DBH}->errstr(), $info{table}, $rowid,= - $action, undef, $timestamp, $dataref , $self->{queue}); - return $ret if $ret;=09=09 - } - my ($newrowid) =3D $self->GetOneRow("select currval('$rowidsequence')"); - return 'Failed to get current rowid on inserted row'.$self->{DBH}->errstr= - if not defined $newrowid; - $self->changerowid($rowid, $newrowid); - } - - if ($action eq 'update') { - $self->{updatecount}++; - if ($info{updatelimit} && $self->{updatecount} > $info{updatelimit}) { - # too many changes from client - my $ret =3D $self->sanity('update'); - return $ret if $ret; - } - my @data; - foreach my $val (@{$dataref}) { - push @data, $self->{DBH}->quote($val); - }=09 - - my $sql =3D "update $info{table} set "; - my @cols =3D @{$info{cols}}; - foreach my $col (@cols) { - my $val =3D shift @data; - $sql =3D $sql . "$col =3D $val,"; - } - $sql =3D $sql." ____rowver____ =3D $self->{this_post_ver}"; - $sql =3D $sql.", ____stamp____ =3D '".localtime($timestamp)."'" if $times= -tamp; - $sql =3D $sql." where ____rowid____ =3D $rowid"; - $sql =3D $sql." and $info{whereclause}" if $info{whereclause}; - my $ret =3D $self->{DBH}->do($sql); - if (!$ret) { - my $ret =3D $self->collide($self->{DBH}->errstr(), $info{table}, $rowid,= - $action, undef, $timestamp, $dataref , $self->{queue}); - return $ret if $ret;=09=09 - } - - } - - if ($action eq 'delete') { - $self->{deletecount}++; - if ($info{deletelimit} && $self->{deletecount} > $info{deletelimit}) { - # too many changes from client - my $ret =3D $self->sanity('delete'); - return $ret if $ret; - } - if ($timestamp) { - my $sql =3D "update $info{table} set ____rowver____ =3D $self->{this_pos= -t_ver}, ____stamp____ =3D '".localtime($timestamp)."' where ____rowid____ = -=3D $rowid"; - $sql =3D $sql . " where $info{whereclause}" if $info{whereclause}; - $self->{DBH}->do($sql) || return 'Predelete update failed: '.$self->{DBH= -}->errstr; - } else { - my $sql =3D "update $info{table} set ____rowver____ =3D $self->{this_pos= -t_ver} where ____rowid____ =3D $rowid"; - $sql =3D $sql . " where $info{whereclause}" if $info{whereclause}; - $self->{DBH}->do($sql) || return 'Predelete update failed: '.$self->{DBH= -}->errstr; - } - my $sql =3D "delete from $info{table} where ____rowid____ =3D $rowid"; - $sql =3D $sql . " where $info{whereclause}" if $info{whereclause}; - my $ret =3D $self->{DBH}->do($sql); - if (!$ret) { - my $ret =3D $self->collide($self->{DBH}->errstr(), $info{table}, $rowid,= - $action, undef, $timestamp, $dataref , $self->{queue}); - return $ret if $ret;=09=09 - } -} -=09 -=09 - return undef; -} - -sub changerowid { - my $self =3D shift; - my $oldid =3D shift; - my $newid =3D shift; - $self->writeclient('changeid',"$oldid\t$newid"); -} - -#writes info to client -sub writeclient { - my $self =3D shift; - my $type =3D shift; - my @info =3D @_; - print "$type: ",join("\t",@info),"\n"; - return undef; -} - -# Override this for custom behavior. Default is to echo back the sanity fa= -ilure reason.=20=20 -# If you want to override a collision, you can do so by returning undef. -sub sanity { - my $self =3D shift; - my $reason =3D shift; - $self->{status} =3D 'sanity exceeded'; - $self->{DBH}->rollback; - return $reason; -} - -# Override this for custom behavior. Default is to echo back the failure r= -eason.=20=20 -# If you want to override a collision, you can do so by returning undef. -sub collide { - my $self =3D shift; - my ($reason,$table,$rowid,$action,$rowver,$timestamp,$data, $queue) =3D @_; - - my @data; - foreach my $val (@{$data}) { - push @data, $self->{DBH}->quote($val); - }=09 - - if ($reason =3D~ /integrity/i || $reason =3D~ /constraint/i) { - $self->{status} =3D 'intergrity violation'; - $self->{DBH}->rollback; - } - - my $datastring; - my @cols =3D @{$self->{current}->{cols}}; - foreach my $col (@cols) { - my $val =3D shift @data; - $datastring =3D $datastring . "$col =3D $val,"; - } - chop $datastring; #remove trailing comma - - if ($queue eq 'server') { - $timestamp =3D localtime($timestamp) if defined($timestamp); - $rowid =3D $self->{DBH}->quote($rowid); - $rowid =3D 'null' if !defined($rowid); - $rowver =3D 'null' if !defined($rowver); - $timestamp =3D $self->{DBH}->quote($timestamp); - $data =3D $self->{DBH}->quote($data); - my $qtable =3D $self->{DBH}->quote($table); - my $qreason =3D $self->{DBH}->quote($reason); - my $qaction =3D $self->{DBH}->quote($action); - my $quser =3D $self->{DBH}->quote($self->{user}); - my $qnode =3D $self->{DBH}->quote($self->{node}); - $datastring =3D $self->{DBH}->quote($datastring); - - - my $sql =3D "insert into ____collision____ (rowid, -tablename, rowver, stamp, data, reason, action, username, -nodename, queue) values($rowid,$qtable, $rowver, $timestamp,$datastring, -$qreason, $qaction,$quser, $qnode)"; - $self->{DBH}->do($sql) || die 'Failed to write to collision table: '.$sel= -f->{DBH}->errstr; - - } else { - - $self->writeclient('collision',$rowid,$table, $rowver, $timestamp,$reason= -, $action,$self->{user}, $self->{node}, $data); - - } - return $reason; -} - -#calls get_updates once for each publication the user/node is subscribed to= - in correct sync_order -sub get_all_updates { - my $self =3D shift; - my $quser =3D $self->{DBH}->quote($self->{user}); - my $qnode =3D $self->{DBH}->quote($self->{node}); - - foreach my $pub (@{$self->{orderpubs}}) { - $self->get_updates($pub, 1); #request update as sync unless overrridden b= -y flags - } - -} - -# Call this once for each table the client needs refreshed or sync'ed AFTER= - all inbound client changes have been posted -# Accepts publication and sync flag as arguments -sub get_updates { - my $self =3D shift; - my $pub =3D shift || die 'Publication is required'; - my $sync =3D shift; - - my $qpub =3D $self->{DBH}->quote($pub); - my $quser =3D $self->{DBH}->quote($self->{user}); - my $qnode =3D $self->{DBH}->quote($self->{node}); - - #enforce refresh and refreshonce flags - undef $sync if !$self->{pubs}->{$pub};=20 - - - my %info =3D $self->{current}; - - my @cols =3D $self->GetColList("select col_name from ____subscribed_cols__= -__ where username =3D $quser and nodename =3D $qnode and pubname =3D $qpub"= -);; - - my ($table) =3D $self->GetOneRow("select tablename from ____publications__= -__ where pubname =3D $qpub"); - return 'Table incorrectly registered for read' if !defined($table); - my $qtable =3D $self->{DBH}->quote($table);=09 - - - my $sql =3D "select pubname, last_session, post_ver, last_ver, whereclause= - from ____subscribed____ where username =3D $quser and pubname =3D $qpub an= -d nodename =3D $qnode"; - my ($junk, $last_session, $post_ver, $last_ver, $whereclause) =3D $self->G= -etOneRow($sql); - - my ($wc) =3D $self->GetOneRow("select whereclause from ____publications___= -_ where pubname =3D $qpub"); - - $whereclause =3D '('.$whereclause.')' if $whereclause; - - $whereclause =3D $whereclause.' and ('.$wc.')' if $wc; - - - if ($sync) { - $self->writeclient('start synchronize', $pub); - } else { - $self->writeclient('start refresh', $pub); - $self->{DBH}->do("update ____subscribed____ set refreshonce =3D false whe= -re pubname =3D $qpub and username =3D $quser and nodename =3D $qnode") || r= -eturn 'Failed to clear RefreshOnce flag: '.$self->{DBH}->errstr; - } - - $self->writeclient('columns',@cols); - - - - my $sql =3D "select ____rowid____, ".join(',', @cols)." from $table"; - if ($sync) { - $sql =3D $sql." where (____rowver____ <=3D $self->{max_ver} and ____rowve= -r____ > $last_ver)"; - if (defined($self->{this_post_ver})) { - $sql =3D $sql . " and (____rowver____ <> $post_ver)"; - } - } else { - $sql =3D $sql." where (____rowver____ <=3D $self->{max_ver})"; - } - $sql =3D $sql." and $whereclause" if $whereclause; -=09 - my $sth =3D $self->{DBH}->prepare($sql) || return 'Failed to get prepare S= -QL for updates: '.$self->{DBH}->errstr; - $sth->execute || return 'Failed to execute SQL for updates: '.$self->{DBH}= -->errstr; - my @row; - while (@row =3D $sth->fetchrow_array) { - $self->writeclient('update/insert',@row); - } - - $sth->finish; - - # now get deleted rows - if ($sync) { - $sql =3D "select rowid from ____deleted____ where (tablename =3D $qtable)= -"; - $sql =3D $sql." and (rowver <=3D $self->{max_ver} and rowver > $last_ver)= -"; - if (defined($self->{this_post_ver})) { - $sql =3D $sql . " and (rowver <> $self->{this_post_ver})"; - } - $sql =3D $sql." and $whereclause" if $whereclause; - - $sth =3D $self->{DBH}->prepare($sql) || return 'Failed to get prepare SQL= - for deletes: '.$self->{DBH}->errstr; - $sth->execute || return 'Failed to execute SQL for deletes: '.$self->{DBH= -}->errstr; - my @row; - while (@row =3D $sth->fetchrow_array) { - $self->writeclient('delete',@row); - } - - $sth->finish; - } - - if ($sync) { - $self->writeclient('end synchronize', $pub); - } else { - $self->writeclient('end refresh', $pub); - } - - my $qpub =3D $self->{DBH}->quote($pub); - my $quser =3D $self->{DBH}->quote($self->{user}); - my $qnode =3D $self->{DBH}->quote($self->{node}); - - $self->{DBH}->do("update ____subscribed____ set last_ver =3D $self->{max_v= -er}, last_session =3D now(), post_ver =3D $self->{this_post_ver} where user= -name =3D $quser and nodename =3D $qnode and pubname =3D $qpub"); - return undef; -} - - -# Call this once when everything else is done. Does housekeeping.=20 -# (MAKE THIS AN OBJECT DESTRUCTOR?) -sub DESTROY { - my $self =3D shift; - -#release version from lock table (including old ones) - my $quser =3D $self->{DBH}->quote($self->{user}); - my $qnode =3D $self->{DBH}->quote($self->{node}); - my $sql =3D "delete from ____last_stable____ where username =3D $quser and= - nodename =3D $qnode"; - $self->{DBH}->do($sql); - -#clean up deleted table - my ($version) =3D $self->GetOneRow("select min(last_ver) from ____subscrib= -ed____"); - return undef if not defined $version; - $self->{DBH}->do("delete from ____deleted____ where rowver < $version") ||= - return 'Failed to prune deleted table'.$self->{DBH}->errstr;; - - -#disconnect from DBD sessions - $self->{DBH}->disconnect; - $self->{DBLOG}->disconnect; - return undef; -} - -############# Helper Subs ############ -sub GetColList { - my $self =3D shift; - my $sql =3D shift || die 'Must provide sql select statement'; - my $sth =3D $self->{DBH}->prepare($sql) || return undef; - $sth->execute || return undef; - my $val; - my @col; - while (($val) =3D $sth->fetchrow_array) { - push @col, $val; - } - $sth->finish; - return @col; -} - -sub GetOneRow { - my $self =3D shift; - my $sql =3D shift || die 'Must provide sql select statement'; - my $sth =3D $self->{DBH}->prepare($sql) || return undef; - $sth->execute || return undef; - my @row =3D $sth->fetchrow_array; - $sth->finish; - return @row; -} - -=20 - - - -package SyncManager; - -use DBI; -# new requires 3 arguments: dbi connection string, plus the corresponding u= -sername and password - -sub new { - my $proto =3D shift; - my $class =3D ref($proto) || $proto; - my $self =3D {}; - - my $dbi =3D shift; - my $user =3D shift; - my $pass =3D shift; - - $self->{DBH} =3D DBI->connect($dbi,$user,$pass) || die "Failed to connect = -to database: ".DBI->errstr(); - - $self->{DBLOG}=3D DBI->connect($dbi,$user,$pass) || die "cannot log to DB:= - ".DBI->errstr(); -=09 - return bless ($self, $class); -} - -sub dblog {=20 - my $self =3D shift; - my $msg =3D $self->{DBLOG}->quote($_[0]); - my $quser =3D $self->{DBH}->quote($self->{user}); - my $qnode =3D $self->{DBH}->quote($self->{node}); - $self->{DBLOG}->do("insert into ____sync_log____ (username, nodename,stamp= -, message) values($quser, $qnode, now(), $msg)"); -} - -#this should never need to be called, but it might if a node bails without = -releasing their locks -sub ReleaseAllLocks { - my $self =3D shift; - $self->{DBH}->do("delete from ____last_stable____)"); -} -# Adds a publication to the system. Also adds triggers, sequences, etc ass= -ociated with the table if approproate. - # accepts two argument: the name of a physical table and the name under wh= -ich to publish it=20 - # NOTE: the publication name is optional and will default to the table na= -me if not supplied - # returns undef if ok, else error string; -sub publish { - my $self =3D shift; - my $table =3D shift || die 'You must provide a table name (and optionally = -a unique publication name)'; - my $pub =3D shift; - $pub =3D $table if not defined($pub); - - my $qpub =3D $self->{DBH}->quote($pub); - my $sql =3D "select tablename from ____publications____ where pubname =3D = -$qpub"; - my ($junk) =3D $self->GetOneRow($sql); - return 'Publication already exists' if defined($junk); - - my $qtable =3D $self->{DBH}->quote($table); - - $sql =3D "select table_id, refcount from ____tables____ where tablename = -=3D $qtable"; - my ($id, $refcount) =3D $self->GetOneRow($sql); - - if(!defined($id)) { - $self->{DBH}->do("insert into ____tables____ (tablename, refcount) values= - ($qtable,1)") || return 'Failed to register table: ' . $self->{DBH}->errst= -r; - my $sql =3D "select table_id from ____tables____ where tablename =3D $qta= -ble"; - ($id) =3D $self->GetOneRow($sql); - } - - if (defined($refcount)) { - $self->{DBH}->do("update ____tables____ set refcount =3D refcount+1 where= - table_id =3D $id") || return 'Failed to update refrence count: ' . $self->= -{DBH}->errstr; - } else { -=09=09 - $id =3D '_'.$id.'_';=20 - - my @cols =3D $self->GetTableCols($table, 1); # 1 =3D get hidden cols too - my %skip; - foreach my $col (@cols) { - $skip{$col} =3D 1; - } -=09=09 - if (!$skip{____rowver____}) { - $self->{DBH}->do("alter table $table add column ____rowver____ int4"); #= -don't fail here in case table is being republished, just accept the error s= -ilently - } - $self->{DBH}->do("update $table set ____rowver____ =3D ____version_seq___= -_.last_value - 1") || return 'Failed to initialize rowver: ' . $self->{DBH}= -->errstr; - - if (!$skip{____rowid____}) { - $self->{DBH}->do("alter table $table add column ____rowid____ int4"); #d= -on't fail here in case table is being republished, just accept the error si= -lently - } - - my $index =3D $id.'____rowid____idx'; - $self->{DBH}->do("create index $index on $table(____rowid____)") || retur= -n 'Failed to create rowid index: ' . $self->{DBH}->errstr; - - my $sequence =3D $id.'_rowid_seq'; - $self->{DBH}->do("create sequence $sequence") || return 'Failed to create= - rowver sequence: ' . $self->{DBH}->errstr; - - $self->{DBH}->do("alter table $table alter column ____rowid____ set defau= -lt nextval('$sequence')"); #don't fail here in case table is being republis= -hed, just accept the error silently - - $self->{DBH}->do("update $table set ____rowid____ =3D nextval('$sequence= -')") || return 'Failed to initialize rowid: ' . $self->{DBH}->errstr; - - if (!$skip{____stamp____}) { - $self->{DBH}->do("alter table $table add column ____stamp____ timestamp"= -); #don't fail here in case table is being republished, just accept the err= -or silently - } - - $self->{DBH}->do("update $table set ____stamp____ =3D now()") || return = -'Failed to initialize stamp: ' . $self->{DBH}->errstr; - - my $trigger =3D $id.'_ver_ins'; - $self->{DBH}->do("create trigger $trigger before insert on $table for eac= -h row execute procedure sync_insert_ver()") || return 'Failed to create tri= -gger: ' . $self->{DBH}->errstr; - - my $trigger =3D $id.'_ver_upd'; - $self->{DBH}->do("create trigger $trigger before update on $table for eac= -h row execute procedure sync_update_ver()") || return 'Failed to create tri= -gger: ' . $self->{DBH}->errstr; - - my $trigger =3D $id.'_del_row'; - $self->{DBH}->do("create trigger $trigger after delete on $table for each= - row execute procedure sync_delete_row()") || return 'Failed to create trig= -ger: ' . $self->{DBH}->errstr; - } - - $self->{DBH}->do("insert into ____publications____ (pubname, tablename) va= -lues ('$pub','$table')") || return 'Failed to create publication entry: '.$= -self->{DBH}->errstr; - - return undef; -} - - -# Removes a publication from the system. Also drops triggers, sequences, e= -tc associated with the table if approproate. - # accepts one argument: the name of a publication - # returns undef if ok, else error string; -sub unpublish { - my $self =3D shift; - my $pub =3D shift || return 'You must provide a publication name'; - my $qpub =3D $self->{DBH}->quote($pub); - my $sql =3D "select tablename from ____publications____ where pubname =3D = -$qpub"; - my ($table) =3D $self->GetOneRow($sql); - return 'Publication does not exist' if !defined($table); - - my $qtable =3D $self->{DBH}->quote($table); - - $sql =3D "select table_id, refcount from ____tables____ where tablename = -=3D $qtable"; - my ($id, $refcount) =3D $self->GetOneRow($sql); - return 'Table: $table is not correctly registered!' if not defined($id); - - $self->{DBH}->do("update ____tables____ set refcount =3D refcount -1 where= - tablename =3D $qtable") || return 'Failed to decrement reference count: ' = -. $self->{DBH}->errstr; - - $self->{DBH}->do("delete from ____subscribed____ where pubname =3D $qpub")= - || return 'Failed to delete user subscriptions: ' . $self->{DBH}->errstr; - $self->{DBH}->do("delete from ____subscribed_cols____ where pubname =3D $q= -pub") || return 'Failed to delete subscribed columns: ' . $self->{DBH}->err= -str; - $self->{DBH}->do("delete from ____publications____ where tablename =3D $qt= -able and pubname =3D $qpub") || return 'Failed to delete from publications:= - ' . $self->{DBH}->errstr; - - #if this is the last reference, we want to drop triggers, etc; - if ($refcount <=3D 1) { - $id =3D "_".$id."_"; - - $self->{DBH}->do("alter table $table alter column ____rowver____ drop def= -ault") || return 'Failed to alter column default: ' . $self->{DBH}->errstr; - $self->{DBH}->do("alter table $table alter column ____rowid____ drop defa= -ult") || return 'Failed to alter column default: ' . $self->{DBH}->errstr; - $self->{DBH}->do("alter table $table alter column ____stamp____ drop defa= -ult") || return 'Failed to alter column default: ' . $self->{DBH}->errstr; - - my $trigger =3D $id.'_ver_upd'; - $self->{DBH}->do("drop trigger $trigger on $table") || return 'Failed to = -drop trigger: ' . $self->{DBH}->errstr; - - my $trigger =3D $id.'_ver_ins'; - $self->{DBH}->do("drop trigger $trigger on $table") || return 'Failed to = -drop trigger: ' . $self->{DBH}->errstr; - - my $trigger =3D $id.'_del_row'; - $self->{DBH}->do("drop trigger $trigger on $table") || return 'Failed to = -drop trigger: ' . $self->{DBH}->errstr; - - my $sequence =3D $id.'_rowid_seq'; - $self->{DBH}->do("drop sequence $sequence") || return 'Failed to drop seq= -uence: ' . $self->{DBH}->errstr; - - my $index =3D $id.'____rowid____idx'; - $self->{DBH}->do("drop index $index") || return 'Failed to drop index: ' = -. $self->{DBH}->errstr; - $self->{DBH}->do("delete from ____tables____ where tablename =3D $qtable"= -) || return 'remove entry from tables: ' . $self->{DBH}->errstr; - } -return undef; -} - - - - - -#Subscribe user/node to a publication - # Accepts 3 arguements: Username, Nodename, Publication - # NOTE: the remaining arguments can be supplied as column names to which = -the user/node should be subscribed - # Return undef if ok, else returns an error string - -sub subscribe { - my $self =3D shift; - my $user =3D shift || die 'You must provide user, node and publication as = -arguments'; - my $node =3D shift || die 'You must provide user, node and publication as = -arguments'; - my $pub =3D shift || die 'You must provide user, node and publication as a= -rguments'; - my @cols =3D @_; - - my $quser =3D $self->{DBH}->quote($user); - my $qnode =3D $self->{DBH}->quote($node); - my $qpub =3D $self->{DBH}->quote($pub); - - my $sql =3D "select tablename from ____publications____ where pubname =3D = -$qpub"; - my ($table) =3D $self->GetOneRow($sql); - return "Publication $pub does not exist." if not defined $table; - my $qtable =3D $self->{DBH}->quote($table); - - @cols =3D $self->GetTableCols($table) if !@cols; # get defaults if cols we= -re not spefified by caller - - $self->{DBH}->do("insert into ____subscribed____ (username, nodename,pubna= -me,last_ver,refreshonce) values('$user', '$node','$pub',0, true)") || retur= -n 'Failes to create subscription: ' . $self->{DBH}->errstr;=09 - - foreach my $col (@cols) { - $self->{DBH}->do("insert into ____subscribed_cols____ (username, nodename= -, pubname, col_name) values ('$user','$node','$pub','$col')") || return 'Fa= -iles to subscribe column: ' . $self->{DBH}->errstr;=09 - } - - return undef; -} - - -#Unsubscribe user/node to a publication - # Accepts 3 arguements: Username, Nodename, Publication - # Return undef if ok, else returns an error string - -sub unsubscribe { - my $self =3D shift; - my $user =3D shift || die 'You must provide user, node and publication as = -arguments'; - my $node =3D shift || die 'You must provide user, node and publication as = -arguments'; - my $pub =3D shift || die 'You must provide user, node and publication as a= -rguments'; - my @cols =3D @_; - - my $quser =3D $self->{DBH}->quote($user); - my $qnode =3D $self->{DBH}->quote($node); - my $qpub =3D $self->{DBH}->quote($pub); - - my $sql =3D "select tablename from ____publications____ where pubname =3D = -$qpub"; - my $table =3D $self->GetOneRow($sql); - return "Publication $pub does not exist." if not defined $table; - - $self->{DBH}->do("delete from ____subscribed_cols____ where pubname =3D $q= -pub and username =3D $quser and nodename =3D $qnode") || return 'Failed to = -remove column subscription: '. $self->{DBH}->errstr; - $self->{DBH}->do("delete from ____subscribed____ where pubname =3D $qpub a= -nd username =3D $quser and nodename =3D $qnode") || return 'Failed to remov= -e subscription: '. $self->{DBH}->errstr; - - - return undef; -} - - - -#INSTALL creates the necessary management tables.=20=20 - #returns undef if everything is ok, else returns a string describing the e= -rror; -sub INSTALL { -my $self =3D shift; - -#check to see if management tables are already installed - -my ($test) =3D $self->GetOneRow("select * from pg_class where relname =3D '= -____publications____'"); -if (defined($test)) { - return 'It appears that synchronization manangement tables are already ins= -talled here. Please uninstall before reinstalling.'; -}; - - - -#install the management tables, etc. - -$self->{DBH}->do("create table ____publications____ (pubname text primary k= -ey,description text, tablename text, sync_order int4, whereclause text)") |= -| return $self->{DBH}->errstr(); - -$self->{DBH}->do("create table ____subscribed_cols____ (nodename text, user= -name text, pubname text, col_name text, description text, primary key(noden= -ame, username, pubname,col_name))") || return $self->{DBH}->errstr(); - -$self->{DBH}->do("create table ____subscribed____ (nodename text, username = -text, pubname text, last_session timestamp, post_ver int4, last_ver int4, w= -hereclause text, sanity_limit int4 default 0, sanity_delete int4 default 0,= - sanity_update int4 default 0, sanity_insert int4 default 50, readonly bool= -ean, disabled boolean, fullrefreshonly boolean, refreshonce boolean, primar= -y key(nodename, username, pubname))") || return $self->{DBH}->errstr(); - -$self->{DBH}->do("create table ____last_stable____ (version int4, username = -text, nodename text, primary key(version, username, nodename))") || return = -$self->{DBH}->errstr(); - -$self->{DBH}->do("create table ____tables____ (tablename text, table_id int= -4, refcount int4, primary key(tablename, table_id))") || return $self->{DBH= -}->errstr(); - -$self->{DBH}->do("create sequence ____table_id_seq____") || return $self->{= -DBH}->errstr(); - -$self->{DBH}->do("alter table ____tables____ alter column table_id set defa= -ult nextval('____table_id_seq____')") || return $self->{DBH}->errstr(); - -$self->{DBH}->do("create table ____deleted____ (rowid int4, tablename text,= - rowver int4, stamp timestamp, primary key (rowid, tablename))") || return = -$self->{DBH}->errstr(); - -$self->{DBH}->do("create table ____collision____ (rowid text, tablename tex= -t, rowver int4, stamp timestamp, faildate timestamp default now(),data text= -,reason text, action text, username text, nodename text,queue text)") || re= -turn $self->{DBH}->errstr(); - -$self->{DBH}->do("create sequence ____version_seq____") || return $self->{D= -BH}->errstr(); - -$self->{DBH}->do("create table ____sync_log____ (username text, nodename te= -xt, stamp timestamp, message text)") || return $self->{DBH}->errstr(); - -$self->{DBH}->do("create function sync_insert_ver() returns opaque as -'begin -if new.____rowver____ isnull then -new.____rowver____ :=3D ____version_seq____.last_value; -end if; -if new.____stamp____ isnull then -new.____stamp____ :=3D now(); -end if; -return NEW; -end;' language 'plpgsql'") || return $self->{DBH}->errstr(); - -$self->{DBH}->do("create function sync_update_ver() returns opaque as -'begin -if new.____rowver____ =3D old.____rowver____ then -new.____rowver____ :=3D ____version_seq____.last_value; -end if; -if new.____stamp____ =3D old.____stamp____ then -new.____stamp____ :=3D now(); -end if; -return NEW; -end;' language 'plpgsql'") || return $self->{DBH}->errstr(); - - -$self->{DBH}->do("create function sync_delete_row() returns opaque as=20 -'begin=20 -insert into ____deleted____ (rowid,tablename,rowver,stamp) values -(old.____rowid____, TG_RELNAME, old.____rowver____,old.____stamp____);=20 -return old;=20 -end;' language 'plpgsql'") || return $self->{DBH}->errstr(); - -return undef; -} - -#removes all management tables & related stuff - #returns undef if ok, else returns an error message as a string -sub UNINSTALL { -my $self =3D shift; - -#Make sure all tables are unpublished first -my $sth =3D $self->{DBH}->prepare("select pubname from ____publications____= -"); -$sth->execute; -my $pub; -while (($pub) =3D $sth->fetchrow_array) { - $self->unpublish($pub);=09 -} -$sth->finish; - -$self->{DBH}->do("drop table ____publications____") || return $self->{DBH}-= ->errstr(); -$self->{DBH}->do("drop table ____subscribed_cols____") || return $self->{DB= -H}->errstr(); -$self->{DBH}->do("drop table ____subscribed____") || return $self->{DBH}->e= -rrstr(); -$self->{DBH}->do("drop table ____last_stable____") || return $self->{DBH}->= -errstr(); -$self->{DBH}->do("drop table ____deleted____") || return $self->{DBH}->errs= -tr(); -$self->{DBH}->do("drop table ____collision____") || return $self->{DBH}->er= -rstr(); -$self->{DBH}->do("drop table ____tables____") || return $self->{DBH}->errst= -r(); -$self->{DBH}->do("drop table ____sync_log____") || return $self->{DBH}->err= -str(); - -$self->{DBH}->do("drop sequence ____table_id_seq____") || return $self->{DB= -H}->errstr(); -$self->{DBH}->do("drop sequence ____version_seq____") || return $self->{DBH= -}->errstr(); - -$self->{DBH}->do("drop function sync_insert_ver()") || return $self->{DBH}-= ->errstr(); -$self->{DBH}->do("drop function sync_update_ver()") || return $self->{DBH}-= ->errstr(); -$self->{DBH}->do("drop function sync_delete_row()") || return $self->{DBH}-= ->errstr(); - -return undef; - -} - -sub DESTROY { - my $self =3D shift; - - $self->{DBH}->disconnect; - $self->{DBLOG}->disconnect; - return undef; -} - -############# Helper Subs ############ - -sub GetOneRow { - my $self =3D shift; - my $sql =3D shift || die 'Must provide sql select statement'; - my $sth =3D $self->{DBH}->prepare($sql) || return undef; - $sth->execute || return undef; - my @row =3D $sth->fetchrow_array; - $sth->finish; - return @row; -} - -#call this with second non-zero value to get hidden columns -sub GetTableCols { - my $self =3D shift; - my $table =3D shift || die 'Must provide table name'; - my $wanthidden =3D shift; - my $sql =3D "select * from $table where 0 =3D 1"; - my $sth =3D $self->{DBH}->prepare($sql) || return undef; - $sth->execute || return undef; - my @row =3D @{$sth->{NAME}}; - $sth->finish; - return @row if $wanthidden; - my @cols; - foreach my $col (@row) { - next if $col eq '____rowver____'; - next if $col eq '____stamp____'; - next if $col eq '____rowid____'; - push @cols, $col;=09 - } - return @cols; -} - - -1; #happy require - -------=_NextPart_000_0062_01C0541E.125CAF30-- - - -From pgsql-hackers-owner+M9917@postgresql.org Mon Jun 11 15:53:25 2001 -Return-path: -Received: from postgresql.org (webmail.postgresql.org [216.126.85.28]) - by candle.pha.pa.us (8.10.1/8.10.1) with ESMTP id f5BJrPL01206 - for ; Mon, 11 Jun 2001 15:53:25 -0400 (EDT) -Received: from postgresql.org.org (webmail.postgresql.org [216.126.85.28]) - by postgresql.org (8.11.3/8.11.1) with SMTP id f5BJrPE67753; - Mon, 11 Jun 2001 15:53:25 -0400 (EDT) - (envelope-from pgsql-hackers-owner+M9917@postgresql.org) -Received: from mail.greatbridge.com (mail.greatbridge.com [65.196.68.36]) - by postgresql.org (8.11.3/8.11.1) with ESMTP id f5BJmLE65620 - for ; Mon, 11 Jun 2001 15:48:21 -0400 (EDT) - (envelope-from djohnson@greatbridge.com) -Received: from j2.us.greatbridge.com (djohnsonpc.us.greatbridge.com [65.196.69.70]) - by mail.greatbridge.com (8.11.2/8.11.2) with SMTP id f5BJm2Q28847 - for ; Mon, 11 Jun 2001 15:48:02 -0400 -From: Darren Johnson -Date: Mon, 11 Jun 2001 19:46:44 GMT -Message-ID: <20010611.19464400@j2.us.greatbridge.com> -Subject: [HACKERS] Postgres Replication -To: pgsql-hackers@postgresql.org -Reply-To: Darren Johnson -X-Mailer: Mozilla/3.0 (compatible; StarOffice/5.2;Linux) -X-Priority: 3 (Normal) -MIME-Version: 1.0 -Content-Type: text/plain; charset=ISO-8859-1 -Content-Transfer-Encoding: 8bit -X-MIME-Autoconverted: from quoted-printable to 8bit by postgresql.org id f5BJmLE65621 -Precedence: bulk -Sender: pgsql-hackers-owner@postgresql.org -Status: OR - -We have been researching replication for several months now, and -I have some opinions to share to the community for feedback, -discussion, and/or participation. Our goal is to get a replication -solution for PostgreSQL that will meet most needs of users -and applications alike (mission impossible theme here :). - -My research work along with others contributors has been collected -and presented here http://www.greatbridge.org/genpage?replication_top -If there is something missing, especially PostgreSQL related -work, I would like to know about it, and my apologies to any -one who got left off the list. This work is ongoing and doesn't -draw a conclusion, which IMHO should be left up to the user, -but I'm offering my opinions to spur discussion and/or feed back -from this list, and try not to offend any one. - -Here's my opinion: of the approaches we've surveyed, the most -promising one is the Postgres-R project from the Information and -Communication Systems Group, ETH in Zurich, Switzerland, originally -produced by Bettina Kemme, Gustavo Alonso, and others. Although -Postgres-R is a synchronous approach, I believe it is the closest to -the goal mentioned above. Here is an abstract of the advantages. - -1) Postgres-R is built on the PostgreSQL-6.4.2 code base. The -replication -functionality is an optional parameter, so there will be insignificant -overhead for non replication situations. The replication and -communication -managers are the two new modules added to the PostgreSQL code base. - -2) The replication manager's main function is controlling the -replication protocol via a message handling process. It receives -messages from the local and remote backends and forwards write -sets and decision messages via the communication manager to the -other servers. The replication manager controls all the transactions -running on the local server by keeping track of the states, including -which protocol phase (read, send, lock, or write) the transaction is -in. The replication manager maintains a two way channel -implemented as buffered sockets to each backend. - -3) The main task of the communication manager is to provide simple -socket based interface between the replication manager and the -group communication system (currently Ensemble). The -communication system is a cluster of servers connected via -the communication manager. The replication manager also maintains -three one-way channels to the communication system: a broadcast -channel to send messages, a total-order channel to receive -totally orders write sets, and a no-order channel to listen for -decision messages from the communication system. Decision -messages can be received at any time where the reception of -totally ordered write sets can be blocked in certain phases. - -4) Based on a two phase locking approach, all dead lock situations -are local and detectable by Postgres-R code base, and aborted. - -5) The write set messages used to send database changes to other -servers, can use either the SQL statements or the actual tuples -changed. This is a parameter based on number of tuples changed -by a transaction. While sending the tuple changes reduces -overhead in query parse, plan and execution, there is a negative -effect in sending a large write set across the network. - -6) Postgres-R uses a synchronous approach that keeps the data on -all sites consistent and provides serializability. The user does not -have to bother with conflict resolution, and receives the same -correctness and consistency of a centralized system. - -7) Postgres-R could be part of a good fault-resilient and load -distribution -solution. It is peer-to-peer based and incurs low overhead propagating -updates to the other cluster members. All replicated databases locally -process queries. - -8) Compared to other synchronous replication strategies (e.g., standard -distributed 2-phase-locking + 2-phase-commit), Postgres-R has much -better performance using 2-phase-locking. - - -There are some issues that are not currently addressed by -Postgres-R, but some enhancements made to PostgreSQL since the -6.4.2 tree are very favorable to addressing these short comings. - -1) The addition of WAL in 7.1 has the information for recovering -failed/off-line servers, currently all the servers would have to be -stopped, and a copy would be used to get all the servers synchronized -before starting again. - -2)Being synchronous, Postgres-R would not be a good solution -for off line/WAN scenarios where asynchronous replication is -required. There are some theories on this issue which involve servers -connecting and disconnecting from the cluster. - -3)As in any serialized synchronous approach there is change in the -flow of execution of a transaction; while most of these changes can -be solved by calling newly developed functions at certain time points, -synchronous replica control is tightly coupled with the concurrency -control. -Hence, especially in PostgreSQL 7.2 some parts of the concurrency control -(MVCC) might have to be adjusted. This can lead to a slightly more -complicated maintenance than a system that does not change the backend. - -4)Partial replication is not addressed. - - -Any feedback on this post will be appreciated. - -Thanks, - -Darren - ----------------------------(end of broadcast)--------------------------- -TIP 2: you can get off all lists at once with the unregister command - (send "unregister YourEmailAddressHere" to majordomo@postgresql.org) - -From pgsql-hackers-owner+M9923@postgresql.org Mon Jun 11 18:14:23 2001 -Return-path: -Received: from postgresql.org (webmail.postgresql.org [216.126.85.28]) - by candle.pha.pa.us (8.10.1/8.10.1) with ESMTP id f5BMENL18644 - for ; Mon, 11 Jun 2001 18:14:23 -0400 (EDT) -Received: from postgresql.org.org (webmail.postgresql.org [216.126.85.28]) - by postgresql.org (8.11.3/8.11.1) with SMTP id f5BMEQE14877; - Mon, 11 Jun 2001 18:14:26 -0400 (EDT) - (envelope-from pgsql-hackers-owner+M9923@postgresql.org) -Received: from spoetnik.xs4all.nl (spoetnik.xs4all.nl [194.109.249.226]) - by postgresql.org (8.11.3/8.11.1) with ESMTP id f5BM6ME12270 - for ; Mon, 11 Jun 2001 18:06:23 -0400 (EDT) - (envelope-from reinoud@xs4all.nl) -Received: from KAYAK (kayak [192.168.1.20]) - by spoetnik.xs4all.nl (Postfix) with SMTP id 865A33E1B - for ; Tue, 12 Jun 2001 00:06:16 +0200 (CEST) -From: reinoud@xs4all.nl (Reinoud van Leeuwen) -To: pgsql-hackers@postgresql.org -Subject: Re: [HACKERS] Postgres Replication -Date: Mon, 11 Jun 2001 22:06:07 GMT -Organization: Not organized in any way -Reply-To: reinoud@xs4all.nl -Message-ID: <3b403d96.562404297@192.168.1.10> -References: <20010611.19464400@j2.us.greatbridge.com> -In-Reply-To: <20010611.19464400@j2.us.greatbridge.com> -X-Mailer: Forte Agent 1.5/32.451 -MIME-Version: 1.0 -Content-Type: text/plain; charset=us-ascii -Content-Transfer-Encoding: 8bit -X-MIME-Autoconverted: from quoted-printable to 8bit by postgresql.org id f5BM6PE12276 -Precedence: bulk -Sender: pgsql-hackers-owner@postgresql.org -Status: OR - -On Mon, 11 Jun 2001 19:46:44 GMT, you wrote: - ->We have been researching replication for several months now, and ->I have some opinions to share to the community for feedback, ->discussion, and/or participation. Our goal is to get a replication ->solution for PostgreSQL that will meet most needs of users ->and applications alike (mission impossible theme here :). -> ->My research work along with others contributors has been collected ->and presented here http://www.greatbridge.org/genpage?replication_top ->If there is something missing, especially PostgreSQL related ->work, I would like to know about it, and my apologies to any ->one who got left off the list. This work is ongoing and doesn't ->draw a conclusion, which IMHO should be left up to the user, ->but I'm offering my opinions to spur discussion and/or feed back ->from this list, and try not to offend any one. -> ->Here's my opinion: of the approaches we've surveyed, the most ->promising one is the Postgres-R project from the Information and ->Communication Systems Group, ETH in Zurich, Switzerland, originally ->produced by Bettina Kemme, Gustavo Alonso, and others. Although ->Postgres-R is a synchronous approach, I believe it is the closest to ->the goal mentioned above. Here is an abstract of the advantages. -> ->1) Postgres-R is built on the PostgreSQL-6.4.2 code base. The ->replication ->functionality is an optional parameter, so there will be insignificant ->overhead for non replication situations. The replication and ->communication ->managers are the two new modules added to the PostgreSQL code base. -> ->2) The replication manager's main function is controlling the ->replication protocol via a message handling process. It receives ->messages from the local and remote backends and forwards write ->sets and decision messages via the communication manager to the ->other servers. The replication manager controls all the transactions ->running on the local server by keeping track of the states, including ->which protocol phase (read, send, lock, or write) the transaction is ->in. The replication manager maintains a two way channel ->implemented as buffered sockets to each backend. - -what does "manager controls all the transactions" mean? I hope it does -*not* mean that a bug in the manager would cause transactions not to -commit... - -> ->3) The main task of the communication manager is to provide simple ->socket based interface between the replication manager and the ->group communication system (currently Ensemble). The ->communication system is a cluster of servers connected via ->the communication manager. The replication manager also maintains ->three one-way channels to the communication system: a broadcast ->channel to send messages, a total-order channel to receive ->totally orders write sets, and a no-order channel to listen for ->decision messages from the communication system. Decision ->messages can be received at any time where the reception of ->totally ordered write sets can be blocked in certain phases. -> ->4) Based on a two phase locking approach, all dead lock situations ->are local and detectable by Postgres-R code base, and aborted. - -Does this imply locking over different servers? That would mean a -grinding halt when a network outage occurs... - ->5) The write set messages used to send database changes to other ->servers, can use either the SQL statements or the actual tuples ->changed. This is a parameter based on number of tuples changed ->by a transaction. While sending the tuple changes reduces ->overhead in query parse, plan and execution, there is a negative ->effect in sending a large write set across the network. -> ->6) Postgres-R uses a synchronous approach that keeps the data on ->all sites consistent and provides serializability. The user does not ->have to bother with conflict resolution, and receives the same ->correctness and consistency of a centralized system. -> ->7) Postgres-R could be part of a good fault-resilient and load ->distribution ->solution. It is peer-to-peer based and incurs low overhead propagating ->updates to the other cluster members. All replicated databases locally ->process queries. -> ->8) Compared to other synchronous replication strategies (e.g., standard ->distributed 2-phase-locking + 2-phase-commit), Postgres-R has much ->better performance using 2-phase-locking. - -Coming from a Sybase background I have some experience with -replication. The way it works in Sybase Replication server is as -follows: -- for each replicated database, there is a "log reader" process that -reads the WAL and captures only *committed transactions* to the -replication server. (it does not make much sense to replicate other -things IMHO :-). -- the replication server stores incoming data in a que ("stable -device"), until it is sure it has reached its final destination - -- a replication server can send data to another replication server in -a compact (read: WAN friendly) way. A chain of replication servers can -be made, depending on network architecture) - -- the final replication server makes a almost standard client -connection to the target database and translates the compact -transactions back to SQL statements. By using masks, extra -functionality can be built in. - -This kind of architecture has several advantages: -- only committed transactions are replicated which saves overhead -- it does not have very much impact on performance of the source -server (apart from reading the WAL) -- since every replication server has a stable device, data is stored -when the network is down and nothing gets lost (nor stops performing) -- because only the log reader and the connection from the final -replication server are RDBMS specific, it is possible to replicate -from MS to Oracle using a Sybase replication server (or different -versions etc). - -I do not know how much of this is patented or copyrighted, but the -architecture seems elegant and robust to me. I have done -implementations of bi-directional replication too. It *is* possible -but does require some funky setup and maintenance. (but it is better -that letting offices on different continents working on the same -database :-) - -just my 2 EURO cts :-) - - --- -__________________________________________________ -"Nothing is as subjective as reality" -Reinoud van Leeuwen reinoud@xs4all.nl -http://www.xs4all.nl/~reinoud -__________________________________________________ - ----------------------------(end of broadcast)--------------------------- -TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org - -From pgsql-hackers-owner+M9924@postgresql.org Mon Jun 11 18:41:51 2001 -Return-path: -Received: from postgresql.org (webmail.postgresql.org [216.126.85.28]) - by candle.pha.pa.us (8.10.1/8.10.1) with ESMTP id f5BMfpL28917 - for ; Mon, 11 Jun 2001 18:41:51 -0400 (EDT) -Received: from postgresql.org.org (webmail.postgresql.org [216.126.85.28]) - by postgresql.org (8.11.3/8.11.1) with SMTP id f5BMfsE25092; - Mon, 11 Jun 2001 18:41:54 -0400 (EDT) - (envelope-from pgsql-hackers-owner+M9924@postgresql.org) -Received: from spider.pilosoft.com (p55-222.acedsl.com [160.79.55.222]) - by postgresql.org (8.11.3/8.11.1) with ESMTP id f5BMalE23024 - for ; Mon, 11 Jun 2001 18:36:47 -0400 (EDT) - (envelope-from alex@pilosoft.com) -Received: from localhost (alexmail@localhost) - by spider.pilosoft.com (8.9.3/8.9.3) with ESMTP id SAA06092; - Mon, 11 Jun 2001 18:46:05 -0400 (EDT) -Date: Mon, 11 Jun 2001 18:46:05 -0400 (EDT) -From: Alex Pilosov -To: Reinoud van Leeuwen -cc: pgsql-hackers@postgresql.org -Subject: Re: [HACKERS] Postgres Replication -In-Reply-To: <3b403d96.562404297@192.168.1.10> -Message-ID: -MIME-Version: 1.0 -Content-Type: TEXT/PLAIN; charset=US-ASCII -Precedence: bulk -Sender: pgsql-hackers-owner@postgresql.org -Status: OR - -On Mon, 11 Jun 2001, Reinoud van Leeuwen wrote: - -> On Mon, 11 Jun 2001 19:46:44 GMT, you wrote: - -> what does "manager controls all the transactions" mean? I hope it does -> *not* mean that a bug in the manager would cause transactions not to -> commit... -Well yeah it does. Bugs are a fact of life. :) - -> >4) Based on a two phase locking approach, all dead lock situations -> >are local and detectable by Postgres-R code base, and aborted. -> -> Does this imply locking over different servers? That would mean a -> grinding halt when a network outage occurs... -Don't know, but see below. - -> Coming from a Sybase background I have some experience with -> replication. The way it works in Sybase Replication server is as -> follows: -> - for each replicated database, there is a "log reader" process that -> reads the WAL and captures only *committed transactions* to the -> replication server. (it does not make much sense to replicate other -> things IMHO :-). -> - the replication server stores incoming data in a que ("stable -> device"), until it is sure it has reached its final destination -> -> - a replication server can send data to another replication server in -> a compact (read: WAN friendly) way. A chain of replication servers can -> be made, depending on network architecture) -> -> - the final replication server makes a almost standard client -> connection to the target database and translates the compact -> transactions back to SQL statements. By using masks, extra -> functionality can be built in. -> -> This kind of architecture has several advantages: -> - only committed transactions are replicated which saves overhead -> - it does not have very much impact on performance of the source -> server (apart from reading the WAL) -> - since every replication server has a stable device, data is stored -> when the network is down and nothing gets lost (nor stops performing) -> - because only the log reader and the connection from the final -> replication server are RDBMS specific, it is possible to replicate -> from MS to Oracle using a Sybase replication server (or different -> versions etc). -> -> I do not know how much of this is patented or copyrighted, but the -> architecture seems elegant and robust to me. I have done -> implementations of bi-directional replication too. It *is* possible -> but does require some funky setup and maintenance. (but it is better -> that letting offices on different continents working on the same -> database :-) -Yes, the above architecture is what almost every vendor of replication -software uses. And I'm sure if you worked much with Sybase, you hate the -garbage that their repserver is :). - -The architecture of postgres-r and repserver are fundamentally different -for a good reason: repserver only wants to replicate committed -transactions, while postgres-r is more of a 'clustering' solution (albeit -they don't say this word), and is capable to do much more than simple rep -server. - -I.E. you can safely put half of your clients to second server in a -replicated postgres-r cluster without being worried that a conflict (or a -wierd locking situation) may occur. - -Try that with sybase, it is fundamentally designed for one-way -replication, and the fact that you can do one-way replication in both -directions doesn't mean its safe to do that! - -I'm not sure how postgres-r handles network problems. To be useful, a good -replication solution must have an option of "no network->no updates" as -well as "no network->queue updates and send them later". However, it is -far easier to add queuing to a correct 'eager locking' database than it is -to add proper locking to a queue-based replicator. - --alex - - ----------------------------(end of broadcast)--------------------------- -TIP 3: if posting/reading through Usenet, please send an appropriate -subscribe-nomail command to majordomo@postgresql.org so that your -message can get through to the mailing list cleanly - -From pgsql-hackers-owner+M9932@postgresql.org Mon Jun 11 22:17:54 2001 -Return-path: -Received: from postgresql.org (webmail.postgresql.org [216.126.85.28]) - by candle.pha.pa.us (8.10.1/8.10.1) with ESMTP id f5C2HsL15803 - for ; Mon, 11 Jun 2001 22:17:54 -0400 (EDT) -Received: from postgresql.org.org (webmail.postgresql.org [216.126.85.28]) - by postgresql.org (8.11.3/8.11.1) with SMTP id f5C2HtE86836; - Mon, 11 Jun 2001 22:17:55 -0400 (EDT) - (envelope-from pgsql-hackers-owner+M9932@postgresql.org) -Received: from femail15.sdc1.sfba.home.com (femail15.sdc1.sfba.home.com [24.0.95.142]) - by postgresql.org (8.11.3/8.11.1) with ESMTP id f5C2BXE85020 - for ; Mon, 11 Jun 2001 22:11:33 -0400 (EDT) - (envelope-from djohnson@greatbridge.com) -Received: from greatbridge.com ([65.2.95.27]) - by femail15.sdc1.sfba.home.com - (InterMail vM.4.01.03.20 201-229-121-120-20010223) with ESMTP - id <20010612021124.OZRG17243.femail15.sdc1.sfba.home.com@greatbridge.com>; - Mon, 11 Jun 2001 19:11:24 -0700 -Message-ID: <3B257969.6050405@greatbridge.com> -Date: Mon, 11 Jun 2001 22:07:37 -0400 -From: Darren Johnson -User-Agent: Mozilla/5.0 (Windows; U; WinNT4.0; en-US; m18) Gecko/20001108 Netscape6/6.0 -X-Accept-Language: en -MIME-Version: 1.0 -To: Alex Pilosov , Reinoud van Leeuwen -cc: pgsql-hackers@postgresql.org -Subject: Re: [HACKERS] Postgres Replication -References: -Content-Type: text/plain; charset=us-ascii; format=flowed -Content-Transfer-Encoding: 7bit -Precedence: bulk -Sender: pgsql-hackers-owner@postgresql.org -Status: OR - - -Thanks for the feedback. I'll try to address both your issues here. - ->> what does "manager controls all the transactions" mean? -> -The replication manager controls the transactions by serializing the -write set messages. -This ensures all transactions are committed in the same order on each -server, so bugs -here are not allowed ;-) - ->> I hope it does ->> *not* mean that a bug in the manager would cause transactions not to ->> commit... -> -> Well yeah it does. Bugs are a fact of life. : - -> ->>> 4) Based on a two phase locking approach, all dead lock situations ->>> are local and detectable by Postgres-R code base, and aborted. ->> ->> Does this imply locking over different servers? That would mean a ->> grinding halt when a network outage occurs... -> -> Don't know, but see below. - -There is a branch of the Postgres-R code that has some failure detection -implemented, -so we will have to merge this functionality with the version of -Postgres-R we have, and -test this issue. I'll let you the results. - ->> ->> - the replication server stores incoming data in a que ("stable ->> device"), until it is sure it has reached its final destination -> -I like this idea for recovering servers that have been down a short -period of time, using WAL -to recover transactions missed during the outage. - ->> ->> This kind of architecture has several advantages: ->> - only committed transactions are replicated which saves overhead ->> - it does not have very much impact on performance of the source ->> server (apart from reading the WAL) ->> - since every replication server has a stable device, data is stored ->> when the network is down and nothing gets lost (nor stops performing) ->> - because only the log reader and the connection from the final ->> replication server are RDBMS specific, it is possible to replicate ->> from MS to Oracle using a Sybase replication server (or different ->> versions etc). -> -There are some issues with the "log reader" approach: -1) The databases are not synchronized until the log reader completes its -processing. -2) I'm not sure about Sybase, but the log reader sends SQL statements to -the other servers -which are then parsed, planned and executed. This over head could be -avoided if only -the tuple changes are replicated. -3) Works fine for read only situations, but peer-to-peer applications -using this approach -must be designed with a conflict resolution scheme. - -Don't get me wrong, I believe we can learn from the replication -techniques used by commercial -databases like Sybase, and try to implement the good ones into -PostgreSQL. Postgres-R is -a synchronous approach which out performs the traditional approaches to -synchronous replication. -Being based on PostgreSQL-6.4.2, getting this approach in the 7.2 tree -might be better than -reinventing the wheel. - -Thanks again, - -Darren - - -Thanks again, - -Darren - - ----------------------------(end of broadcast)--------------------------- -TIP 6: Have you searched our list archives? - -http://www.postgresql.org/search.mpl - -From pgsql-hackers-owner+M9936@postgresql.org Tue Jun 12 03:22:51 2001 -Return-path: -Received: from postgresql.org (webmail.postgresql.org [216.126.85.28]) - by candle.pha.pa.us (8.10.1/8.10.1) with ESMTP id f5C7MoL11061 - for ; Tue, 12 Jun 2001 03:22:50 -0400 (EDT) -Received: from postgresql.org.org (webmail.postgresql.org [216.126.85.28]) - by postgresql.org (8.11.3/8.11.1) with SMTP id f5C7MPE35441; - Tue, 12 Jun 2001 03:22:25 -0400 (EDT) - (envelope-from pgsql-hackers-owner+M9936@postgresql.org) -Received: from reorxrsm.server.lan.at (zep3.it-austria.net [213.150.1.73]) - by postgresql.org (8.11.3/8.11.1) with ESMTP id f5C72ZE25009 - for ; Tue, 12 Jun 2001 03:02:36 -0400 (EDT) - (envelope-from ZeugswetterA@wien.spardat.at) -Received: from gz0153.gc.spardat.at (gz0153.gc.spardat.at [172.20.10.149]) - by reorxrsm.server.lan.at (8.11.2/8.11.2) with ESMTP id f5C72Qu27966 - for ; Tue, 12 Jun 2001 09:02:26 +0200 -Received: by sdexcgtw01.f000.d0188.sd.spardat.at with Internet Mail Service (5.5.2650.21) - id ; Tue, 12 Jun 2001 09:02:21 +0200 -Message-ID: <11C1E6749A55D411A9670001FA68796336831B@sdexcsrv1.f000.d0188.sd.spardat.at> -From: Zeugswetter Andreas SB -To: "'Darren Johnson'" , - pgsql-hackers@postgresql.org -Subject: AW: [HACKERS] Postgres Replication -Date: Tue, 12 Jun 2001 09:02:20 +0200 -MIME-Version: 1.0 -X-Mailer: Internet Mail Service (5.5.2650.21) -Content-Type: text/plain; - charset="iso-8859-1" -Precedence: bulk -Sender: pgsql-hackers-owner@postgresql.org -Status: OR - - -> Although -> Postgres-R is a synchronous approach, I believe it is the closest to -> the goal mentioned above. Here is an abstract of the advantages. - -If you only want synchronous replication, why not simply use triggers ? -All you would then need is remote query access and two phase commit, -and maybe a little script that helps create the appropriate triggers. - -Doing a replicate all or nothing approach that only works synchronous -is imho not flexible enough. - -Andreas - ----------------------------(end of broadcast)--------------------------- -TIP 6: Have you searched our list archives? - -http://www.postgresql.org/search.mpl - -From pgsql-hackers-owner+M9945@postgresql.org Tue Jun 12 10:18:29 2001 -Return-path: -Received: from postgresql.org (webmail.postgresql.org [216.126.85.28]) - by candle.pha.pa.us (8.10.1/8.10.1) with ESMTP id f5CEISL06372 - for ; Tue, 12 Jun 2001 10:18:28 -0400 (EDT) -Received: from postgresql.org.org (webmail.postgresql.org [216.126.85.28]) - by postgresql.org (8.11.3/8.11.1) with SMTP id f5CEIQE77517; - Tue, 12 Jun 2001 10:18:26 -0400 (EDT) - (envelope-from pgsql-hackers-owner+M9945@postgresql.org) -Received: from krypton.netropolis.org ([208.222.215.99]) - by postgresql.org (8.11.3/8.11.1) with ESMTP id f5CEDuE75514 - for ; Tue, 12 Jun 2001 10:13:56 -0400 (EDT) - (envelope-from root@generalogic.com) -Received: from [132.216.183.103] (helo=localhost) - by krypton.netropolis.org with esmtp (Exim 3.12 #1 (Debian)) - id 159ouq-0003MU-00 - for ; Tue, 12 Jun 2001 10:13:08 -0400 -To: pgsql-hackers@postgresql.org -Subject: Re: AW: [HACKERS] Postgres Replication -In-Reply-To: <20010612.13321600@j2.us.greatbridge.com> -References: - <20010612.13321600@j2.us.greatbridge.com> -X-Mailer: Mew version 1.94.2 on Emacs 20.7 / Mule 4.0 (HANANOEN) -MIME-Version: 1.0 -Content-Type: Text/Plain; charset=us-ascii -Content-Transfer-Encoding: 7bit -Message-ID: <20010612123623O.root@generalogic.com> -Date: Tue, 12 Jun 2001 12:36:23 +0530 -From: root -X-Dispatcher: imput version 20000414(IM141) -Lines: 47 -Precedence: bulk -Sender: pgsql-hackers-owner@postgresql.org -Status: OR - - -Hello - -I have hacked up a replication layer for Perl code accessing a -database throught the DBI interface. It works pretty well with MySQL -(I can run pre-bender slashcode replicated, haven't tried the more -recent releases). - -Potentially this hack should also work with Pg but I haven't tried -yet. If someone would like to test it out with a complex Pg app and -let me know how it went that would be cool. - -The replication layer is based on Eric Newton's Recall replication -library (www.fault-tolerant.org/recall), and requires that all -database accesses be through the DBI interface. - -The replicas are live, in that every operation affects all the -replicas in real time. Replica outages are invisible to the user, so -long as a majority of the replicas are functioning. Disconnected -replicas can be used for read-only access. - -The only code modification that should be required to use the -replication layer is to change the DSN in connect(): - - my $replicas = '192.168.1.1:7000,192.168.1.2:7000,192.168.1.3:7000'; - my $dbh = DBI->connect("DBI:Recall:database=$replicas"); - -You should be able to install the replication modules with: - -perl -MCPAN -eshell -cpan> install Replication::Recall::DBServer - -and then install DBD::Recall (which doesn't seem to be accessible from -the CPAN shell yet, for some reason), by: - -wget http://www.cpan.org/authors/id/AGUL/DBD-Recall-1.10.tar.gz -tar xzvf DBD-Recall-1.10.tar.gz -cd DBD-Recall-1.10 -perl Makefile.PL -make install - -I would be very interested in hearing about your experiences with -this... - -Thanks - -#! - ----------------------------(end of broadcast)--------------------------- -TIP 3: if posting/reading through Usenet, please send an appropriate -subscribe-nomail command to majordomo@postgresql.org so that your -message can get through to the mailing list cleanly - -From pgsql-hackers-owner+M9938@postgresql.org Tue Jun 12 05:12:54 2001 -Return-path: -Received: from postgresql.org (webmail.postgresql.org [216.126.85.28]) - by candle.pha.pa.us (8.10.1/8.10.1) with ESMTP id f5C9CrL15228 - for ; Tue, 12 Jun 2001 05:12:53 -0400 (EDT) -Received: from postgresql.org.org (webmail.postgresql.org [216.126.85.28]) - by postgresql.org (8.11.3/8.11.1) with SMTP id f5C9CnE91297; - Tue, 12 Jun 2001 05:12:49 -0400 (EDT) - (envelope-from pgsql-hackers-owner+M9938@postgresql.org) -Received: from mobile.hub.org (SHW39-29.accesscable.net [24.138.39.29]) - by postgresql.org (8.11.3/8.11.1) with ESMTP id f5C98DE89175 - for ; Tue, 12 Jun 2001 05:08:13 -0400 (EDT) - (envelope-from scrappy@hub.org) -Received: from localhost (scrappy@localhost) - by mobile.hub.org (8.11.3/8.11.1) with ESMTP id f5C97f361630; - Tue, 12 Jun 2001 06:07:46 -0300 (ADT) - (envelope-from scrappy@hub.org) -X-Authentication-Warning: mobile.hub.org: scrappy owned process doing -bs -Date: Tue, 12 Jun 2001 06:07:41 -0300 (ADT) -From: The Hermit Hacker -To: Zeugswetter Andreas SB -cc: "'Darren Johnson'" , - -Subject: Re: AW: [HACKERS] Postgres Replication -In-Reply-To: <11C1E6749A55D411A9670001FA68796336831B@sdexcsrv1.f000.d0188.sd.spardat.at> -Message-ID: -MIME-Version: 1.0 -Content-Type: TEXT/PLAIN; charset=US-ASCII -Precedence: bulk -Sender: pgsql-hackers-owner@postgresql.org -Status: OR - - -which I believe is what the rserv implementation in contrib currently does -... no? - -its funny ... what is in contrib right now was developed in a weekend by -Vadim, put in contrib, yet nobody has either used it *or* seen fit to -submit patches to improve it ... ? - -On Tue, 12 Jun 2001, Zeugswetter Andreas SB wrote: - -> -> > Although -> > Postgres-R is a synchronous approach, I believe it is the closest to -> > the goal mentioned above. Here is an abstract of the advantages. -> -> If you only want synchronous replication, why not simply use triggers ? -> All you would then need is remote query access and two phase commit, -> and maybe a little script that helps create the appropriate triggers. -> -> Doing a replicate all or nothing approach that only works synchronous -> is imho not flexible enough. -> -> Andreas -> -> ---------------------------(end of broadcast)--------------------------- -> TIP 6: Have you searched our list archives? -> -> http://www.postgresql.org/search.mpl -> - -Marc G. Fournier ICQ#7615664 IRC Nick: Scrappy -Systems Administrator @ hub.org -primary: scrappy@hub.org secondary: scrappy@{freebsd|postgresql}.org - - ----------------------------(end of broadcast)--------------------------- -TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org - -From pgsql-hackers-owner+M9940@postgresql.org Tue Jun 12 09:39:08 2001 -Return-path: -Received: from postgresql.org (webmail.postgresql.org [216.126.85.28]) - by candle.pha.pa.us (8.10.1/8.10.1) with ESMTP id f5CDd8L03200 - for ; Tue, 12 Jun 2001 09:39:08 -0400 (EDT) -Received: from postgresql.org.org (webmail.postgresql.org [216.126.85.28]) - by postgresql.org (8.11.3/8.11.1) with SMTP id f5CDcmE58175; - Tue, 12 Jun 2001 09:38:48 -0400 (EDT) - (envelope-from pgsql-hackers-owner+M9940@postgresql.org) -Received: from mail.greatbridge.com (mail.greatbridge.com [65.196.68.36]) - by postgresql.org (8.11.3/8.11.1) with ESMTP id f5CDYAE56164 - for ; Tue, 12 Jun 2001 09:34:10 -0400 (EDT) - (envelope-from djohnson@greatbridge.com) -Received: from j2.us.greatbridge.com (djohnsonpc.us.greatbridge.com [65.196.69.70]) - by mail.greatbridge.com (8.11.2/8.11.2) with SMTP id f5CDXeQ03585; - Tue, 12 Jun 2001 09:33:40 -0400 -From: Darren Johnson -Date: Tue, 12 Jun 2001 13:32:16 GMT -Message-ID: <20010612.13321600@j2.us.greatbridge.com> -Subject: Re: AW: [HACKERS] Postgres Replication -To: The Hermit Hacker -cc: Zeugswetter Andreas SB , - -Reply-To: Darren Johnson -In-Reply-To: -References: -X-Mailer: Mozilla/3.0 (compatible; StarOffice/5.2;Linux) -X-Priority: 3 (Normal) -MIME-Version: 1.0 -Content-Type: text/plain; charset=ISO-8859-1 -Content-Transfer-Encoding: 8bit -X-MIME-Autoconverted: from quoted-printable to 8bit by postgresql.org id f5CDYAE56166 -Precedence: bulk -Sender: pgsql-hackers-owner@postgresql.org -Status: OR - - -> which I believe is what the rserv implementation in contrib currently -does -> ... no? - -We tried rserv, PG Link (Joseph Conway), and PosrgreSQL Replicator. All -these projects are trigger based asynchronous replication. They all have -some advantages over the current functionality of Postgres-R some of -which I believe can be addressed: - -1) Partial replication - being able to replicate just one or part of a -table(s) -2) They make no changes to the PostgreSQL code base. (Postgres-R can't -address this one ;) -3) PostgreSQL Replicator has some very nice conflict resolution schemes. - - -Here are some disadvantages to using a "trigger based" approach: - -1) Triggers simply transfer individual data items when they are modified, -they do not keep track of transactions. -2) The execution of triggers within a database imposes a performance -overhead to that database. -3) Triggers require careful management by database administrators. -Someone needs to keep track of all the "alarms" going off. -4) The activation of triggers in a database cannot be easily -rolled back or undone. - - - -> On Tue, 12 Jun 2001, Zeugswetter Andreas SB wrote: - -> > Doing a replicate all or nothing approach that only works synchronous -> > is imho not flexible enough. -> > - - -I agree. Partial and asynchronous replication need to be addressed, -and some of the common functionality of Postgres-R could possibly -be used to meet those needs. - - -Thanks for your feedback, - -Darren - ----------------------------(end of broadcast)--------------------------- -TIP 5: Have you checked our extensive FAQ? - -http://www.postgresql.org/users-lounge/docs/faq.html - -From pgsql-hackers-owner+M9969@postgresql.org Tue Jun 12 16:53:45 2001 -Return-path: -Received: from postgresql.org (webmail.postgresql.org [216.126.85.28]) - by candle.pha.pa.us (8.10.1/8.10.1) with ESMTP id f5CKriL23104 - for ; Tue, 12 Jun 2001 16:53:44 -0400 (EDT) -Received: from postgresql.org.org (webmail.postgresql.org [216.126.85.28]) - by postgresql.org (8.11.3/8.11.1) with SMTP id f5CKrlE87423; - Tue, 12 Jun 2001 16:53:47 -0400 (EDT) - (envelope-from pgsql-hackers-owner+M9969@postgresql.org) -Received: from sectorbase2.sectorbase.com (sectorbase2.sectorbase.com [63.88.121.62] (may be forged)) - by postgresql.org (8.11.3/8.11.1) with SMTP id f5CHWkE69562 - for ; Tue, 12 Jun 2001 13:32:46 -0400 (EDT) - (envelope-from vmikheev@SECTORBASE.COM) -Received: by sectorbase2.sectorbase.com with Internet Mail Service (5.5.2653.19) - id ; Tue, 12 Jun 2001 10:30:29 -0700 -Message-ID: <3705826352029646A3E91C53F7189E32016670@sectorbase2.sectorbase.com> -From: "Mikheev, Vadim" -To: "'Darren Johnson'" , - The Hermit Hacker - -cc: Zeugswetter Andreas SB , - pgsql-hackers@postgresql.org -Subject: RE: AW: [HACKERS] Postgres Replication -Date: Tue, 12 Jun 2001 10:30:27 -0700 -MIME-Version: 1.0 -X-Mailer: Internet Mail Service (5.5.2653.19) -Content-Type: text/plain; - charset="iso-8859-1" -Precedence: bulk -Sender: pgsql-hackers-owner@postgresql.org -Status: OR - -> Here are some disadvantages to using a "trigger based" approach: -> -> 1) Triggers simply transfer individual data items when they -> are modified, they do not keep track of transactions. - -I don't know about other *async* replication engines but Rserv -keeps track of transactions (if I understood you corectly). -Rserv transfers not individual modified data items but -*consistent* snapshot of changes to move slave database from -one *consistent* state (when all RI constraints satisfied) -to another *consistent* state. - -> 4) The activation of triggers in a database cannot be easily -> rolled back or undone. - -What do you mean? - -Vadim - ----------------------------(end of broadcast)--------------------------- -TIP 2: you can get off all lists at once with the unregister command - (send "unregister YourEmailAddressHere" to majordomo@postgresql.org) - -From pgsql-hackers-owner+M9967@postgresql.org Tue Jun 12 16:42:11 2001 -Return-path: -Received: from postgresql.org (webmail.postgresql.org [216.126.85.28]) - by candle.pha.pa.us (8.10.1/8.10.1) with ESMTP id f5CKgBL17982 - for ; Tue, 12 Jun 2001 16:42:11 -0400 (EDT) -Received: from postgresql.org.org (webmail.postgresql.org [216.126.85.28]) - by postgresql.org (8.11.3/8.11.1) with SMTP id f5CKgDE80566; - Tue, 12 Jun 2001 16:42:13 -0400 (EDT) - (envelope-from pgsql-hackers-owner+M9967@postgresql.org) -Received: from mail.greatbridge.com (mail.greatbridge.com [65.196.68.36]) - by postgresql.org (8.11.3/8.11.1) with ESMTP id f5CIVdE07561 - for ; Tue, 12 Jun 2001 14:31:39 -0400 (EDT) - (envelope-from djohnson@greatbridge.com) -Received: from j2.us.greatbridge.com (djohnsonpc.us.greatbridge.com [65.196.69.70]) - by mail.greatbridge.com (8.11.2/8.11.2) with SMTP id f5CIUfQ10080; - Tue, 12 Jun 2001 14:30:41 -0400 -From: Darren Johnson -Date: Tue, 12 Jun 2001 18:29:20 GMT -Message-ID: <20010612.18292000@j2.us.greatbridge.com> -Subject: RE: AW: [HACKERS] Postgres Replication -To: "Mikheev, Vadim" -cc: The Hermit Hacker , - Zeugswetter Andreas SB - , - pgsql-hackers@postgresql.org -Reply-To: Darren Johnson - <3705826352029646A3E91C53F7189E32016670@sectorbase2.sectorbase.com> -References: <3705826352029646A3E91C53F7189E32016670@sectorbase2.sectorbase.com> -X-Mailer: Mozilla/3.0 (compatible; StarOffice/5.2;Linux) -X-Priority: 3 (Normal) -MIME-Version: 1.0 -Content-Type: text/plain; charset=ISO-8859-1 -Content-Transfer-Encoding: 8bit -X-MIME-Autoconverted: from quoted-printable to 8bit by postgresql.org id f5CIVdE07562 -Precedence: bulk -Sender: pgsql-hackers-owner@postgresql.org -Status: OR - - - -> > Here are some disadvantages to using a "trigger based" approach: -> > -> > 1) Triggers simply transfer individual data items when they -> > are modified, they do not keep track of transactions. - -> I don't know about other *async* replication engines but Rserv -> keeps track of transactions (if I understood you corectly). -> Rserv transfers not individual modified data items but -> *consistent* snapshot of changes to move slave database from -> one *consistent* state (when all RI constraints satisfied) -> to another *consistent* state. - -I thought Andreas did a good job of correcting me here. Transaction- -based replication with triggers do not apply to points 1 and 4. I -should have made a distinction between non-transaction and -transaction based replication with triggers. I was not trying to -single out rserv or any other project, and I can see how my wording -implies this misinterpretation (my apologies). - - -> > 4) The activation of triggers in a database cannot be easily -> > rolled back or undone. - -> What do you mean? - -Once the trigger fires, it is not an easy task to abort that -execution via rollback or undo. Again this is not an issue -with a transaction-based trigger approach. - - -Sincerely, - -Darren - ----------------------------(end of broadcast)--------------------------- -TIP 2: you can get off all lists at once with the unregister command - (send "unregister YourEmailAddressHere" to majordomo@postgresql.org) - -From pgsql-hackers-owner+M9943@postgresql.org Tue Jun 12 10:03:02 2001 -Return-path: -Received: from postgresql.org (webmail.postgresql.org [216.126.85.28]) - by candle.pha.pa.us (8.10.1/8.10.1) with ESMTP id f5CE32L04619 - for ; Tue, 12 Jun 2001 10:03:02 -0400 (EDT) -Received: from postgresql.org.org (webmail.postgresql.org [216.126.85.28]) - by postgresql.org (8.11.3/8.11.1) with SMTP id f5CE31E70430; - Tue, 12 Jun 2001 10:03:01 -0400 (EDT) - (envelope-from pgsql-hackers-owner+M9943@postgresql.org) -Received: from fizbanrsm.server.lan.at (zep4.it-austria.net [213.150.1.74]) - by postgresql.org (8.11.3/8.11.1) with ESMTP id f5CDoQE64062 - for ; Tue, 12 Jun 2001 09:50:26 -0400 (EDT) - (envelope-from ZeugswetterA@wien.spardat.at) -Received: from gz0153.gc.spardat.at (gz0153.gc.spardat.at [172.20.10.149]) - by fizbanrsm.server.lan.at (8.11.2/8.11.2) with ESMTP id f5CDoJe11224 - for ; Tue, 12 Jun 2001 15:50:19 +0200 -Received: by sdexcgtw01.f000.d0188.sd.spardat.at with Internet Mail Service (5.5.2650.21) - id ; Tue, 12 Jun 2001 15:50:15 +0200 -Message-ID: <11C1E6749A55D411A9670001FA68796336831F@sdexcsrv1.f000.d0188.sd.spardat.at> -From: Zeugswetter Andreas SB -To: "'Darren Johnson'" , - The Hermit Hacker - -cc: pgsql-hackers@postgresql.org -Subject: AW: AW: [HACKERS] Postgres Replication -Date: Tue, 12 Jun 2001 15:50:09 +0200 -MIME-Version: 1.0 -X-Mailer: Internet Mail Service (5.5.2650.21) -Content-Type: text/plain; - charset="iso-8859-1" -Precedence: bulk -Sender: pgsql-hackers-owner@postgresql.org -Status: OR - - -> Here are some disadvantages to using a "trigger based" approach: -> -> 1) Triggers simply transfer individual data items when they -> are modified, they do not keep track of transactions. -> 2) The execution of triggers within a database imposes a performance -> overhead to that database. -> 3) Triggers require careful management by database administrators. -> Someone needs to keep track of all the "alarms" going off. -> 4) The activation of triggers in a database cannot be easily -> rolled back or undone. - -Yes, points 2 and 3 are a given, although point 2 buys you the functionality -of transparent locking across all involved db servers. -Points 1 and 4 are only the case for a trigger mechanism that does -not use remote connection and 2-phase commit. - -Imho an implementation that opens a separate client connection to the -replication target is only suited for async replication, and for that a WAL -based solution would probably impose less overhead. - -Andreas - ----------------------------(end of broadcast)--------------------------- -TIP 2: you can get off all lists at once with the unregister command - (send "unregister YourEmailAddressHere" to majordomo@postgresql.org) - -From pgsql-hackers-owner+M9946@postgresql.org Tue Jun 12 10:47:09 2001 -Return-path: -Received: from postgresql.org (webmail.postgresql.org [216.126.85.28]) - by candle.pha.pa.us (8.10.1/8.10.1) with ESMTP id f5CEl9L08144 - for ; Tue, 12 Jun 2001 10:47:09 -0400 (EDT) -Received: from postgresql.org.org (webmail.postgresql.org [216.126.85.28]) - by postgresql.org (8.11.3/8.11.1) with SMTP id f5CEihE88714; - Tue, 12 Jun 2001 10:44:43 -0400 (EDT) - (envelope-from pgsql-hackers-owner+M9946@postgresql.org) -Received: from mail.greatbridge.com (mail.greatbridge.com [65.196.68.36]) - by postgresql.org (8.11.3/8.11.1) with ESMTP id f5CEd6E85859 - for ; Tue, 12 Jun 2001 10:39:06 -0400 (EDT) - (envelope-from djohnson@greatbridge.com) -Received: from j2.us.greatbridge.com (djohnsonpc.us.greatbridge.com [65.196.69.70]) - by mail.greatbridge.com (8.11.2/8.11.2) with SMTP id f5CEcgQ04905; - Tue, 12 Jun 2001 10:38:42 -0400 -From: Darren Johnson -Date: Tue, 12 Jun 2001 14:37:18 GMT -Message-ID: <20010612.14371800@j2.us.greatbridge.com> -Subject: Re: AW: AW: [HACKERS] Postgres Replication -To: Zeugswetter Andreas SB -cc: pgsql-hackers@postgresql.org -Reply-To: Darren Johnson - <11C1E6749A55D411A9670001FA68796336831F@sdexcsrv1.f000.d0188.sd.spardat.at> -References: <11C1E6749A55D411A9670001FA68796336831F@sdexcsrv1.f000.d0188.sd.spardat.at> -X-Mailer: Mozilla/3.0 (compatible; StarOffice/5.2;Linux) -X-Priority: 3 (Normal) -MIME-Version: 1.0 -Content-Type: text/plain; charset=ISO-8859-1 -Content-Transfer-Encoding: 8bit -X-MIME-Autoconverted: from quoted-printable to 8bit by postgresql.org id f5CEd6E85860 -Precedence: bulk -Sender: pgsql-hackers-owner@postgresql.org -Status: OR - - - -> Imho an implementation that opens a separate client connection to the -> replication target is only suited for async replication, and for that a -WAL -> based solution would probably impose less overhead. - - -Yes there is significant overhead with opening a connection to a -client, so Postgres-R creates a pool of backends at start up, -coupled with the group communication system (Ensemble) that -significantly reduces this issue. - - -Very good points, - -Darren - - - ----------------------------(end of broadcast)--------------------------- -TIP 6: Have you searched our list archives? - -http://www.postgresql.org/search.mpl - -From pgsql-hackers-owner+M9982@postgresql.org Tue Jun 12 19:04:06 2001 -Return-path: -Received: from postgresql.org (webmail.postgresql.org [216.126.85.28]) - by candle.pha.pa.us (8.10.1/8.10.1) with ESMTP id f5CN46E10043 - for ; Tue, 12 Jun 2001 19:04:06 -0400 (EDT) -Received: from postgresql.org.org (webmail.postgresql.org [216.126.85.28]) - by postgresql.org (8.11.3/8.11.1) with SMTP id f5CN4AE62160; - Tue, 12 Jun 2001 19:04:10 -0400 (EDT) - (envelope-from pgsql-hackers-owner+M9982@postgresql.org) -Received: from spoetnik.xs4all.nl (spoetnik.xs4all.nl [194.109.249.226]) - by postgresql.org (8.11.3/8.11.1) with ESMTP id f5CMxaE60194 - for ; Tue, 12 Jun 2001 18:59:36 -0400 (EDT) - (envelope-from reinoud@xs4all.nl) -Received: from KAYAK (kayak [192.168.1.20]) - by spoetnik.xs4all.nl (Postfix) with SMTP id 435353E1B - for ; Wed, 13 Jun 2001 00:59:28 +0200 (CEST) -From: reinoud@xs4all.nl (Reinoud van Leeuwen) -To: pgsql-hackers@postgresql.org -Subject: Re: AW: AW: [HACKERS] Postgres Replication -Date: Tue, 12 Jun 2001 22:59:23 GMT -Organization: Not organized in any way -Reply-To: reinoud@xs4all.nl -Message-ID: <3b499c5b.652202125@192.168.1.10> -References: <11C1E6749A55D411A9670001FA68796336831F@sdexcsrv1.f000.d0188.sd.spardat.at> -In-Reply-To: <11C1E6749A55D411A9670001FA68796336831F@sdexcsrv1.f000.d0188.sd.spardat.at> -X-Mailer: Forte Agent 1.5/32.451 -MIME-Version: 1.0 -Content-Type: text/plain; charset=us-ascii -Content-Transfer-Encoding: 8bit -X-MIME-Autoconverted: from quoted-printable to 8bit by postgresql.org id f5CMxcE60196 -Precedence: bulk -Sender: pgsql-hackers-owner@postgresql.org -Status: OR - -On Tue, 12 Jun 2001 15:50:09 +0200, you wrote: - -> ->> Here are some disadvantages to using a "trigger based" approach: ->> ->> 1) Triggers simply transfer individual data items when they ->> are modified, they do not keep track of transactions. ->> 2) The execution of triggers within a database imposes a performance ->> overhead to that database. ->> 3) Triggers require careful management by database administrators. ->> Someone needs to keep track of all the "alarms" going off. ->> 4) The activation of triggers in a database cannot be easily ->> rolled back or undone. -> ->Yes, points 2 and 3 are a given, although point 2 buys you the functionality ->of transparent locking across all involved db servers. ->Points 1 and 4 are only the case for a trigger mechanism that does ->not use remote connection and 2-phase commit. -> ->Imho an implementation that opens a separate client connection to the ->replication target is only suited for async replication, and for that a WAL ->based solution would probably impose less overhead. - -Well as I read back the thread I see 2 different approaches to -replication: - -1: tight integrated replication. -pro: -- bi-directional (or multidirectional): updates are possible -everywhere -- A cluster of servers allways has the same state. -- it does not matter to which server you connect -con: -- network between servers will be a bottleneck, especially if it is a -WAN connection -- only full replication possible -- what happens if one server is down? (or the network between) are -commits still possible - -2: async replication -pro: -- long distance possible -- no problems with network outages -- only changes are replicated, selects do not have impact -- no locking issues accross servers -- partial replication possible (many->one (datawarehouse), or one-many -(queries possible everywhere, updates only central) -- goof for failover situations (backup server is standing by) -con: -- bidirectional replication hard to set up (you'll have to implement -conflict resolution according to your business rules) -- different servers are not guaranteed to be in the same state. - -I can think of some scenarios where I would definitely want to -*choose* one of the options. A load-balanced web environment would -likely want the first option, but synchronizing offices in different -continents might not work with 2-phase commit over the network.... - -And we have not even started talking about *managing* replicated -environments. A lot of fail-over scenarios stop planning after the -backup host has take control. But how to get back? --- -__________________________________________________ -"Nothing is as subjective as reality" -Reinoud van Leeuwen reinoud@xs4all.nl -http://www.xs4all.nl/~reinoud -__________________________________________________ - ----------------------------(end of broadcast)--------------------------- -TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org - -From pgsql-hackers-owner+M9986@postgresql.org Tue Jun 12 19:48:48 2001 -Return-path: -Received: from postgresql.org (webmail.postgresql.org [216.126.85.28]) - by candle.pha.pa.us (8.10.1/8.10.1) with ESMTP id f5CNmmE13125 - for ; Tue, 12 Jun 2001 19:48:48 -0400 (EDT) -Received: from postgresql.org.org (webmail.postgresql.org [216.126.85.28]) - by postgresql.org (8.11.3/8.11.1) with SMTP id f5CNmqE76673; - Tue, 12 Jun 2001 19:48:52 -0400 (EDT) - (envelope-from pgsql-hackers-owner+M9986@postgresql.org) -Received: from sss.pgh.pa.us ([192.204.191.242]) - by postgresql.org (8.11.3/8.11.1) with ESMTP id f5CNdQE73923 - for ; Tue, 12 Jun 2001 19:39:26 -0400 (EDT) - (envelope-from tgl@sss.pgh.pa.us) -Received: from sss2.sss.pgh.pa.us (tgl@localhost [127.0.0.1]) - by sss.pgh.pa.us (8.11.3/8.11.3) with ESMTP id f5CNdI016442; - Tue, 12 Jun 2001 19:39:18 -0400 (EDT) -To: reinoud@xs4all.nl -cc: pgsql-hackers@postgresql.org -Subject: Re: AW: AW: [HACKERS] Postgres Replication -In-Reply-To: <3b499c5b.652202125@192.168.1.10> -References: <11C1E6749A55D411A9670001FA68796336831F@sdexcsrv1.f000.d0188.sd.spardat.at> <3b499c5b.652202125@192.168.1.10> -Comments: In-reply-to reinoud@xs4all.nl (Reinoud van Leeuwen) - message dated "Tue, 12 Jun 2001 22:59:23 +0000" -Date: Tue, 12 Jun 2001 19:39:18 -0400 -Message-ID: <16439.992389158@sss.pgh.pa.us> -From: Tom Lane -Precedence: bulk -Sender: pgsql-hackers-owner@postgresql.org -Status: OR - -reinoud@xs4all.nl (Reinoud van Leeuwen) writes: -> Well as I read back the thread I see 2 different approaches to -> replication: -> ... -> I can think of some scenarios where I would definitely want to -> *choose* one of the options. - -Yes. IIRC, it looks to be possible to support a form of async -replication using the Postgres-R approach: you allow the cluster -to break apart when communications fail, and then rejoin when -your link comes back to life. (This can work in principle, how -close it is to reality is another question; but the rejoin operation -is the same as crash recovery, so you have to have it anyway.) - -So this seems to me to allow getting most of the benefits of the async -approach. OTOH it is difficult to see how to go the other way: getting -the benefits of a synchronous solution atop a basically-async -implementation doesn't seem like it can work. - - regards, tom lane - ----------------------------(end of broadcast)--------------------------- -TIP 6: Have you searched our list archives? - -http://www.postgresql.org/search.mpl - -From pgsql-hackers-owner+M9997@postgresql.org Wed Jun 13 09:05:56 2001 -Return-path: -Received: from postgresql.org (webmail.postgresql.org [216.126.85.28]) - by candle.pha.pa.us (8.10.1/8.10.1) with ESMTP id f5DD5tE28260 - for ; Wed, 13 Jun 2001 09:05:55 -0400 (EDT) -Received: from postgresql.org.org (webmail.postgresql.org [216.126.85.28]) - by postgresql.org (8.11.3/8.11.1) with SMTP id f5DD5xE12437; - Wed, 13 Jun 2001 09:05:59 -0400 (EDT) - (envelope-from pgsql-hackers-owner+M9997@postgresql.org) -Received: from fizbanrsm.server.lan.at (zep4.it-austria.net [213.150.1.74]) - by postgresql.org (8.11.3/8.11.1) with ESMTP id f5DD19E00635 - for ; Wed, 13 Jun 2001 09:01:10 -0400 (EDT) - (envelope-from ZeugswetterA@wien.spardat.at) -Received: from gz0153.gc.spardat.at (gz0153.gc.spardat.at [172.20.10.149]) - by fizbanrsm.server.lan.at (8.11.2/8.11.2) with ESMTP id f5DD13m08153 - for ; Wed, 13 Jun 2001 15:01:03 +0200 -Received: by sdexcgtw01.f000.d0188.sd.spardat.at with Internet Mail Service (5.5.2650.21) - id ; Wed, 13 Jun 2001 15:00:02 +0200 -Message-ID: <11C1E6749A55D411A9670001FA687963368322@sdexcsrv1.f000.d0188.sd.spardat.at> -From: Zeugswetter Andreas SB -To: "'reinoud@xs4all.nl'" , pgsql-hackers@postgresql.org -Subject: AW: AW: AW: [HACKERS] Postgres Replication -Date: Wed, 13 Jun 2001 11:55:48 +0200 -MIME-Version: 1.0 -X-Mailer: Internet Mail Service (5.5.2650.21) -Content-Type: text/plain; - charset="iso-8859-1" -Precedence: bulk -Sender: pgsql-hackers-owner@postgresql.org -Status: OR - - -> Well as I read back the thread I see 2 different approaches to -> replication: -> -> 1: tight integrated replication. -> pro: -> - bi-directional (or multidirectional): updates are possible everywhere -> - A cluster of servers allways has the same state. -> - it does not matter to which server you connect -> con: -> - network between servers will be a bottleneck, especially if it is a -> WAN connection -> - only full replication possible - -I do not understand that point, if it is trigger based, you -have all the flexibility you need. (only some tables, only some rows, -different rows to different targets ....), -(or do you mean not all targets, that could also be achieved with triggers) - -> - what happens if one server is down? (or the network between) are -> commits still possible - -No, updates are not possible if one target is not reachable, -that would not be synchronous and would again need business rules -to resolve conflicts. - -Allowing updates when a target is not reachable would require admin -intervention. - -Andreas - ----------------------------(end of broadcast)--------------------------- -TIP 4: Don't 'kill -9' the postmaster - -From pgsql-hackers-owner+M10005@postgresql.org Wed Jun 13 11:15:48 2001 -Return-path: -Received: from postgresql.org (webmail.postgresql.org [216.126.85.28]) - by candle.pha.pa.us (8.10.1/8.10.1) with ESMTP id f5DFFmE08382 - for ; Wed, 13 Jun 2001 11:15:48 -0400 (EDT) -Received: from postgresql.org.org (webmail.postgresql.org [216.126.85.28]) - by postgresql.org (8.11.3/8.11.1) with SMTP id f5DFFoE53621; - Wed, 13 Jun 2001 11:15:50 -0400 (EDT) - (envelope-from pgsql-hackers-owner+M10005@postgresql.org) -Received: from mail.greatbridge.com (mail.greatbridge.com [65.196.68.36]) - by postgresql.org (8.11.3/8.11.1) with ESMTP id f5DEk7E38930 - for ; Wed, 13 Jun 2001 10:46:07 -0400 (EDT) - (envelope-from djohnson@greatbridge.com) -Received: from j2.us.greatbridge.com (djohnsonpc.us.greatbridge.com [65.196.69.70]) - by mail.greatbridge.com (8.11.2/8.11.2) with SMTP id f5DEhfQ22566; - Wed, 13 Jun 2001 10:43:41 -0400 -From: Darren Johnson -Date: Wed, 13 Jun 2001 14:44:11 GMT -Message-ID: <20010613.14441100@j2.us.greatbridge.com> -Subject: Re: AW: AW: AW: [HACKERS] Postgres Replication -To: Zeugswetter Andreas SB -cc: "'reinoud@xs4all.nl'" , pgsql-hackers@postgresql.org -Reply-To: Darren Johnson - <11C1E6749A55D411A9670001FA687963368322@sdexcsrv1.f000.d0188.sd.spardat.at> -References: <11C1E6749A55D411A9670001FA687963368322@sdexcsrv1.f000.d0188.sd.spardat.at> -X-Mailer: Mozilla/3.0 (compatible; StarOffice/5.2;Linux) -X-Priority: 3 (Normal) -MIME-Version: 1.0 -Content-Type: text/plain; charset=ISO-8859-1 -Content-Transfer-Encoding: 8bit -X-MIME-Autoconverted: from quoted-printable to 8bit by postgresql.org id f5DEk8E38931 -Precedence: bulk -Sender: pgsql-hackers-owner@postgresql.org -Status: OR - - -> > - only full replication possible - -> I do not understand that point, if it is trigger based, you -> have all the flexibility you need. (only some tables, only some rows, -> different rows to different targets ....), -> (or do you mean not all targets, that could also be achieved with -triggers) - -Currently with Postgres-R, it is one database replicating all tables to -all servers in the group communication system. There are some ways -around -this by invoking the -r option when a SQL statement should be replicated, -and leaving the -r option off for non-replicated scenarios. IMHO this is -not a good solution. - -A better solution will need to be implemented, which involves a -subscription table(s) with relation/server information. There are two -ideas for subscribing and receiving replicated data. - -1) Receiver driven propagation - A simple solution where all -transactions are propagated and the receiving servers will reference -the subscription information before applying updates. - -2) Sender driven propagation - A more optimal and complex solution -where servers do not receive any messages regarding data items for -which they have not subscribed - - -> > - what happens if one server is down? (or the network between) are -> > commits still possible - -> No, updates are not possible if one target is not reachable, - -AFAIK, Postgres-R can still replicate if one target is not reachable, -but only to the remaining servers ;). - -There is a scenario that could arise if a server issues a lock -request then fails or goes off line. There is code that checks -for this condition, which needs to be merged with the branch we have. - -> that would not be synchronous and would again need business rules -> to resolve conflicts. - -Yes the failed server would not be synchronized, and getting this -failed server back in sync needs to be addressed. - -> Allowing updates when a target is not reachable would require admin -> intervention. - -In its current state yes, but our goal would be to eliminate this -requirement as well. - - - -Darren - ----------------------------(end of broadcast)--------------------------- -TIP 3: if posting/reading through Usenet, please send an appropriate -subscribe-nomail command to majordomo@postgresql.org so that your -message can get through to the mailing list cleanly - -From pgsql-hackers-owner+M18443=candle.pha.pa.us=pgman@postgresql.org Mon Feb 4 19:16:17 2002 -Return-path: -Received: from server1.pgsql.org (www.postgresql.org [64.49.215.9]) - by candle.pha.pa.us (8.11.6/8.10.1) with SMTP id g150GGP03822 - for ; Mon, 4 Feb 2002 19:16:16 -0500 (EST) -Received: (qmail 77444 invoked by alias); 5 Feb 2002 00:16:11 -0000 -Received: from unknown (HELO postgresql.org) (64.49.215.8) - by www.postgresql.org with SMTP; 5 Feb 2002 00:16:11 -0000 -Received: from snoopy.mohawksoft.com (h0050bf7a618d.ne.mediaone.net [24.147.138.78]) - by postgresql.org (8.11.3/8.11.4) with ESMTP id g150Esl77040 - for ; Mon, 4 Feb 2002 19:14:54 -0500 (EST) - (envelope-from markw@mohawksoft.com) -Received: from mohawksoft.com (localhost [127.0.0.1]) - by snoopy.mohawksoft.com (8.11.6/8.11.6) with ESMTP id g150AWh08676 - for ; Mon, 4 Feb 2002 19:10:33 -0500 -Message-ID: <3C5F22F8.C9B958F0@mohawksoft.com> -Date: Mon, 04 Feb 2002 19:10:32 -0500 -From: mlw -X-Mailer: Mozilla 4.78 [en] (X11; U; Linux 2.4.17 i686) -X-Accept-Language: en -MIME-Version: 1.0 -To: PostgreSQL-development -Subject: [HACKERS] Replication -Content-Type: text/plain; charset=us-ascii -Content-Transfer-Encoding: 7bit -Precedence: bulk -Sender: pgsql-hackers-owner@postgresql.org -Status: OR - -I re-wrote RServ.pm to C, and wrote a replication daemon. It works, but it -works like the whole rserv project. I don't like it. - -OK, what the hell do we need to do to get PostgreSQL replicating? - ----------------------------(end of broadcast)--------------------------- -TIP 4: Don't 'kill -9' the postmaster - -From pgsql-hackers-owner+M18445=candle.pha.pa.us=pgman@postgresql.org Mon Feb 4 19:57:01 2002 -Return-path: -Received: from server1.pgsql.org (www.postgresql.org [64.49.215.9]) - by candle.pha.pa.us (8.11.6/8.10.1) with SMTP id g150v0P06518 - for ; Mon, 4 Feb 2002 19:57:00 -0500 (EST) -Received: (qmail 90440 invoked by alias); 5 Feb 2002 00:56:59 -0000 -Received: from unknown (HELO postgresql.org) (64.49.215.8) - by www.postgresql.org with SMTP; 5 Feb 2002 00:56:59 -0000 -Received: from www1.navtechinc.com ([192.234.226.140]) - by postgresql.org (8.11.3/8.11.4) with ESMTP id g150rMl89885 - for ; Mon, 4 Feb 2002 19:53:22 -0500 (EST) - (envelope-from ssinger@navtechinc.com) -Received: from pcNavYkfAdm1.ykf.navtechinc.com (wall [192.234.226.190]) - by www1.navtechinc.com (8.9.3/8.9.3) with ESMTP id AAA06047; - Tue, 5 Feb 2002 00:53:22 GMT -Received: from localhost (ssinger@localhost) - by pcNavYkfAdm1.ykf.navtechinc.com (8.9.3/8.9.3) with ESMTP id AAA10675; - Tue, 5 Feb 2002 00:52:43 GMT -Date: Tue, 5 Feb 2002 00:52:43 +0000 (GMT) -From: Steven -X-X-Sender: -To: mlw -cc: PostgreSQL-development -Subject: Re: [HACKERS] Replication -In-Reply-To: <3C5F22F8.C9B958F0@mohawksoft.com> -Message-ID: -MIME-Version: 1.0 -Content-Type: TEXT/PLAIN; charset=US-ASCII -Precedence: bulk -Sender: pgsql-hackers-owner@postgresql.org -Status: OR - -On Mon, 4 Feb 2002, mlw wrote: - -I've developed a replacement for Rserv and we are planning on releasing -it as open source(ie as a contrib module). - -Like Rserv its trigger based but its much more flexible. -The key adventages it has over Rserv is that it has --Support for multiple slaves --It Perserves transactions while doing the mirroring. Ie If rows A,B are -originally added in the same transaction they will be mirrored in the same -transaction. - -We have plans on adding filtering based on data/selective mirroring as -well. (Ie only rows with COUNTRY='Canada' go to -slave A, and rows with COUNTRY='China' go to slave B). -But I'm not sure when I'll get to that. - -Support for conflict resolution(If allow edits to be made on the slaves) -would be nice. - -I hope to be able to send a tarball with the source to the pgpatches list -within the next few days. - -We've been using the system operationally for a number of months and have -been happy with it. - -> I re-wrote RServ.pm to C, and wrote a replication daemon. It works, but it -> works like the whole rserv project. I don't like it. -> OK, what the hell do we need to do to get PostgreSQL replicating? -> -> ---------------------------(end of broadcast)--------------------------- -> TIP 4: Don't 'kill -9' the postmaster -> - --- -Steven Singer ssinger@navtechinc.com -Aircraft Performance Systems Phone: 519-747-1170 ext 282 -Navtech Systems Support Inc. AFTN: CYYZXNSX SITA: YYZNSCR -Waterloo, Ontario ARINC: YKFNSCR - - ----------------------------(end of broadcast)--------------------------- -TIP 2: you can get off all lists at once with the unregister command - (send "unregister YourEmailAddressHere" to majordomo@postgresql.org) - -From pgsql-hackers-owner+M18447=candle.pha.pa.us=pgman@postgresql.org Mon Feb 4 20:06:57 2002 -Return-path: -Received: from server1.pgsql.org (www.postgresql.org [64.49.215.9]) - by candle.pha.pa.us (8.11.6/8.10.1) with SMTP id g1516vP07508 - for ; Mon, 4 Feb 2002 20:06:57 -0500 (EST) -Received: (qmail 92753 invoked by alias); 5 Feb 2002 01:06:55 -0000 -Received: from unknown (HELO postgresql.org) (64.49.215.8) - by www.postgresql.org with SMTP; 5 Feb 2002 01:06:55 -0000 -Received: from inflicted.crimelabs.net (crimelabs.net [66.92.101.112]) - by postgresql.org (8.11.3/8.11.4) with ESMTP id g150vhl91978 - for ; Mon, 4 Feb 2002 19:57:44 -0500 (EST) - (envelope-from bpalmer@crimelabs.net) -Received: from mizer.crimelabs.net (mizer.crimelabs.net [192.168.88.10]) - by inflicted.crimelabs.net (Postfix) with ESMTP - id 9D6EE8779; Mon, 4 Feb 2002 19:57:46 -0500 (EST) -Date: Mon, 4 Feb 2002 19:57:34 -0500 (EST) -From: bpalmer -To: mlw -cc: PostgreSQL-development -Subject: Re: [HACKERS] Replication -In-Reply-To: <3C5F22F8.C9B958F0@mohawksoft.com> -Message-ID: -MIME-Version: 1.0 -Content-Type: TEXT/PLAIN; charset=US-ASCII -Precedence: bulk -Sender: pgsql-hackers-owner@postgresql.org -Status: OR - -> -> OK, what the hell do we need to do to get PostgreSQL replicating? - -I hope you understand that replication, done right, is a massive -project. I know that Darren any myself (and the rest of the pg-repl -folks) have been waiting till 7.2 went gold till we did anymore work. I -think we hope to have master / slave replicatin working for 7.3 and then -target multimaster for 7.4. At least that's the hope. - -- Brandon - ----------------------------------------------------------------------------- - c: 646-456-5455 h: 201-798-4983 - b. palmer, bpalmer@crimelabs.net pgp:crimelabs.net/bpalmer.pgp5 - - ----------------------------(end of broadcast)--------------------------- -TIP 2: you can get off all lists at once with the unregister command - (send "unregister YourEmailAddressHere" to majordomo@postgresql.org) - -From pgsql-hackers-owner+M18449=candle.pha.pa.us=pgman@postgresql.org Mon Feb 4 21:16:56 2002 -Return-path: -Received: from server1.pgsql.org (www.postgresql.org [64.49.215.9]) - by candle.pha.pa.us (8.11.6/8.10.1) with SMTP id g152GtP10503 - for ; Mon, 4 Feb 2002 21:16:55 -0500 (EST) -Received: (qmail 6711 invoked by alias); 5 Feb 2002 02:16:53 -0000 -Received: from unknown (HELO postgresql.org) (64.49.215.8) - by www.postgresql.org with SMTP; 5 Feb 2002 02:16:53 -0000 -Received: from snoopy.mohawksoft.com (h0050bf7a618d.ne.mediaone.net [24.147.138.78]) - by postgresql.org (8.11.3/8.11.4) with ESMTP id g151qSl99469 - for ; Mon, 4 Feb 2002 20:52:28 -0500 (EST) - (envelope-from markw@mohawksoft.com) -Received: from mohawksoft.com (localhost [127.0.0.1]) - by snoopy.mohawksoft.com (8.11.6/8.11.6) with ESMTP id g151lph09147; - Mon, 4 Feb 2002 20:47:51 -0500 -Message-ID: <3C5F39C7.970F4549@mohawksoft.com> -Date: Mon, 04 Feb 2002 20:47:51 -0500 -From: mlw -X-Mailer: Mozilla 4.78 [en] (X11; U; Linux 2.4.17 i686) -X-Accept-Language: en -MIME-Version: 1.0 -To: Steven -cc: PostgreSQL-development -Subject: Re: [HACKERS] Replication -References: -Content-Type: text/plain; charset=us-ascii -Content-Transfer-Encoding: 7bit -Precedence: bulk -Sender: pgsql-hackers-owner@postgresql.org -Status: OR - -Steven wrote: -> -> On Mon, 4 Feb 2002, mlw wrote: -> -> I've developed a replacement for Rserv and we are planning on releasing -> it as open source(ie as a contrib module). -> -> Like Rserv its trigger based but its much more flexible. -> The key adventages it has over Rserv is that it has -> -Support for multiple slaves -> -It Perserves transactions while doing the mirroring. Ie If rows A,B are -> originally added in the same transaction they will be mirrored in the same -> transaction. - -I did a similar thing. I took the rserv trigger "as is," but rewrote the -replication support code. What I eventually did was write a "snapshot daemon" -which created snapshot files. Then a "slave daemon" which would check the last -snapshot applied and apply all the snapshots, in order, as needed. One would -run one of these daemons per slave server. - ----------------------------(end of broadcast)--------------------------- -TIP 5: Have you checked our extensive FAQ? - -http://www.postgresql.org/users-lounge/docs/faq.html - -From pgsql-hackers-owner+M18448=candle.pha.pa.us=pgman@postgresql.org Mon Feb 4 20:57:25 2002 -Return-path: -Received: from server1.pgsql.org (www.postgresql.org [64.49.215.9]) - by candle.pha.pa.us (8.11.6/8.10.1) with SMTP id g151vOP09239 - for ; Mon, 4 Feb 2002 20:57:24 -0500 (EST) -Received: (qmail 99828 invoked by alias); 5 Feb 2002 01:57:19 -0000 -Received: from unknown (HELO postgresql.org) (64.49.215.8) - by www.postgresql.org with SMTP; 5 Feb 2002 01:57:19 -0000 -Received: from snoopy.mohawksoft.com (h0050bf7a618d.ne.mediaone.net [24.147.138.78]) - by postgresql.org (8.11.3/8.11.4) with ESMTP id g151s0l99529 - for ; Mon, 4 Feb 2002 20:54:00 -0500 (EST) - (envelope-from markw@mohawksoft.com) -Received: from mohawksoft.com (localhost [127.0.0.1]) - by snoopy.mohawksoft.com (8.11.6/8.11.6) with ESMTP id g151nah09156; - Mon, 4 Feb 2002 20:49:37 -0500 -Message-ID: <3C5F3A30.A4C46FB8@mohawksoft.com> -Date: Mon, 04 Feb 2002 20:49:36 -0500 -From: mlw -X-Mailer: Mozilla 4.78 [en] (X11; U; Linux 2.4.17 i686) -X-Accept-Language: en -MIME-Version: 1.0 -To: bpalmer -cc: PostgreSQL-development -Subject: Re: [HACKERS] Replication -References: -Content-Type: text/plain; charset=us-ascii -Content-Transfer-Encoding: 7bit -Precedence: bulk -Sender: pgsql-hackers-owner@postgresql.org -Status: OR - -bpalmer wrote: -> -> > -> > OK, what the hell do we need to do to get PostgreSQL replicating? -> -> I hope you understand that replication, done right, is a massive -> project. I know that Darren any myself (and the rest of the pg-repl -> folks) have been waiting till 7.2 went gold till we did anymore work. I -> think we hope to have master / slave replicatin working for 7.3 and then -> target multimaster for 7.4. At least that's the hope. - -I do know how hard replication is. I also understand how important it is. - -If you guys have a project going, and need developers, I am more than willing. - ----------------------------(end of broadcast)--------------------------- -TIP 5: Have you checked our extensive FAQ? - -http://www.postgresql.org/users-lounge/docs/faq.html - -From pgsql-hackers-owner+M18450=candle.pha.pa.us=pgman@postgresql.org Mon Feb 4 21:42:13 2002 -Return-path: -Received: from server1.pgsql.org (www.postgresql.org [64.49.215.9]) - by candle.pha.pa.us (8.11.6/8.10.1) with SMTP id g152gCP11957 - for ; Mon, 4 Feb 2002 21:42:13 -0500 (EST) -Received: (qmail 14229 invoked by alias); 5 Feb 2002 02:42:09 -0000 -Received: from unknown (HELO postgresql.org) (64.49.215.8) - by www.postgresql.org with SMTP; 5 Feb 2002 02:42:09 -0000 -Received: from www1.navtechinc.com ([192.234.226.140]) - by postgresql.org (8.11.3/8.11.4) with ESMTP id g152SBl10682 - for ; Mon, 4 Feb 2002 21:28:11 -0500 (EST) - (envelope-from ssinger@navtechinc.com) -Received: from pcNavYkfAdm1.ykf.navtechinc.com (wall [192.234.226.190]) - by www1.navtechinc.com (8.9.3/8.9.3) with ESMTP id CAA06384; - Tue, 5 Feb 2002 02:28:13 GMT -Received: from localhost (ssinger@localhost) - by pcNavYkfAdm1.ykf.navtechinc.com (8.9.3/8.9.3) with ESMTP id CAA10682; - Tue, 5 Feb 2002 02:27:35 GMT -Date: Tue, 5 Feb 2002 02:27:35 +0000 (GMT) -From: Steven -X-X-Sender: -To: mlw -cc: PostgreSQL-development -Subject: Re: [HACKERS] Replication -In-Reply-To: <3C5F39C7.970F4549@mohawksoft.com> -Message-ID: -MIME-Version: 1.0 -Content-Type: TEXT/PLAIN; charset=US-ASCII -Precedence: bulk -Sender: pgsql-hackers-owner@postgresql.org -Status: OR - - -DBMirror doesn't use snapshot's instead it records a log of transactions -that are committed to the database in a pair of tables. -In the case of an INSERT this is the row that is being added. -In the case of a delete the primary key of the row being deleted. - -And in the case of an UPDATE, the primary key before the update along with -all of the data the row should have after an update. - -Then for each slave database a perl script walks though the transactions -that are pending for that host and reconstructs SQL to send the row edits -to that host. A record of the fact that transaction Y has been sent to -host X is also kept. - -When transaction X has been sent to all of the hosts that are in the -system it is then deleted from the Pending tables. - -I suspect that all of the information I'm storing in the Pending tables is -also being stored by Postgres in its log but I haven't investigated how -the information could be extracted(or how long it is kept for). That -would reduce the extra storage overhead that the replication system -imposes. - -As I remember(Its been a while since I've looked at it) RServ uses OID's -in its tables to point to the data that needs to be replicated. We tried -a similar approach but found difficulties with doing partial updates. - - - - - - -On Mon, 4 Feb 2002, mlw wrote: - -> I did a similar thing. I took the rserv trigger "as is," but rewrote the -> replication support code. What I eventually did was write a "snapshot daemon" -> which created snapshot files. Then a "slave daemon" which would check the last -> snapshot applied and apply all the snapshots, in order, as needed. One would -> run one of these daemons per slave server. - - - - - - --- -Steven Singer ssinger@navtechinc.com -Aircraft Performance Systems Phone: 519-747-1170 ext 282 -Navtech Systems Support Inc. AFTN: CYYZXNSX SITA: YYZNSCR -Waterloo, Ontario ARINC: YKFNSCR - - ----------------------------(end of broadcast)--------------------------- -TIP 2: you can get off all lists at once with the unregister command - (send "unregister YourEmailAddressHere" to majordomo@postgresql.org) - -From pgsql-hackers-owner+M18554=candle.pha.pa.us=pgman@postgresql.org Thu Feb 7 02:49:48 2002 -Return-path: -Received: from server1.pgsql.org (www.postgresql.org [64.49.215.9]) - by candle.pha.pa.us (8.11.6/8.10.1) with SMTP id g177nlP04347 - for ; Thu, 7 Feb 2002 02:49:47 -0500 (EST) -Received: (qmail 22556 invoked by alias); 7 Feb 2002 07:49:49 -0000 -Received: from unknown (HELO postgresql.org) (64.49.215.8) - by www.postgresql.org with SMTP; 7 Feb 2002 07:49:49 -0000 -Received: from linuxworld.com.au (www.linuxworld.com.au [203.34.46.50]) - by postgresql.org (8.11.3/8.11.4) with ESMTP id g177QfE19572 - for ; Thu, 7 Feb 2002 02:26:42 -0500 (EST) - (envelope-from swm@linuxworld.com.au) -Received: from localhost (swm@localhost) - by linuxworld.com.au (8.11.4/8.11.4) with ESMTP id g177RiU06086; - Thu, 7 Feb 2002 18:27:45 +1100 -Date: Thu, 7 Feb 2002 18:27:44 +1100 (EST) -From: Gavin Sherry -To: mlw -cc: PostgreSQL-development -Subject: Re: [HACKERS] Replication -In-Reply-To: <3C5F22F8.C9B958F0@mohawksoft.com> -Message-ID: -MIME-Version: 1.0 -Content-Type: TEXT/PLAIN; charset=US-ASCII -Precedence: bulk -Sender: pgsql-hackers-owner@postgresql.org -Status: OR - -On Mon, 4 Feb 2002, mlw wrote: - -> I re-wrote RServ.pm to C, and wrote a replication daemon. It works, but it -> works like the whole rserv project. I don't like it. -> -> OK, what the hell do we need to do to get PostgreSQL replicating? - -The trigger model is not a very sophisticated one. I think I have a better --- though more complicated -- one. This model would be able to handle -multiple masters and master->slave. - -First of all, all machines in the cluster would have to be aware all the -machines in the cluster. This would have to be stored in a new system -table. - -The FE/BE protocol would need to be modified to accepted parsed node trees -generated by pg_analyze_and_rewrite(). These could then be dispatched by -the executing server, inside of pg_exec_query_string, to all other servers -in the cluster (excluding itself). Naturally, this dispatch would need to -be non-blocking. - -pg_exec_query_string() would need to check that nodetags to make sure -selects and perhaps some commands are not dispatched. - -Before the executing server runs finish_xact_command(), it would check -that the query was successfully executed on all machines otherwise -abort. Such a system would need a few configuration options: whether or -not you abort on failed replication to slaves, the ability to replicate -only certain tables, etc. - -Naturally, this would slow down writes to the system (possibly a lot -depending on the performance difference between the executing machine and -the least powerful machine in the cluster), but most usages of postgresql -are read intensive, not write. - -Any reason this model would not work? - -Gavin - - ----------------------------(end of broadcast)--------------------------- -TIP 4: Don't 'kill -9' the postmaster - -From pgsql-hackers-owner+M18558=candle.pha.pa.us=pgman@postgresql.org Thu Feb 7 08:31:00 2002 -Return-path: -Received: from server1.pgsql.org (www.postgresql.org [64.49.215.9]) - by candle.pha.pa.us (8.11.6/8.10.1) with SMTP id g17DUxP13923 - for ; Thu, 7 Feb 2002 08:30:59 -0500 (EST) -Received: (qmail 91796 invoked by alias); 7 Feb 2002 13:30:55 -0000 -Received: from unknown (HELO postgresql.org) (64.49.215.8) - by www.postgresql.org with SMTP; 7 Feb 2002 13:30:55 -0000 -Received: from snoopy.mohawksoft.com (h0050bf7a618d.ne.mediaone.net [24.147.138.78]) - by postgresql.org (8.11.3/8.11.4) with ESMTP id g17Cw0E87782 - for ; Thu, 7 Feb 2002 07:58:01 -0500 (EST) - (envelope-from markw@mohawksoft.com) -Received: from mohawksoft.com (localhost [127.0.0.1]) - by snoopy.mohawksoft.com (8.11.6/8.11.6) with ESMTP id g17CqNt16887; - Thu, 7 Feb 2002 07:52:24 -0500 -Message-ID: <3C627887.CC9FF837@mohawksoft.com> -Date: Thu, 07 Feb 2002 07:52:23 -0500 -From: mlw -X-Mailer: Mozilla 4.78 [en] (X11; U; Linux 2.4.17 i686) -X-Accept-Language: en -MIME-Version: 1.0 -To: Gavin Sherry -cc: PostgreSQL-development -Subject: Re: [HACKERS] Replication -References: -Content-Type: text/plain; charset=us-ascii -Content-Transfer-Encoding: 7bit -Precedence: bulk -Sender: pgsql-hackers-owner@postgresql.org -Status: OR - -Gavin Sherry wrote: -> Naturally, this would slow down writes to the system (possibly a lot -> depending on the performance difference between the executing machine and -> the least powerful machine in the cluster), but most usages of postgresql -> are read intensive, not write. -> -> Any reason this model would not work? - -What, then is the purpose of replication to multiple masters? - -I can think of only two reasons why you want replication. (1) Redundancy, make -sure that if one server dies, then another server has the same data and is used -seamlessly. (2) Increase performance over one system. - -In reason (1) I submit that a server load balance which sits on top of -PostgreSQL, and executes writes on both servers while distributing reads would -be best. This is a HUGE project. The load balancer must know EXACTLY how the -system is configured, which includes all functions and everything. - -In reason (2) your system would fail to provide the scalability that would be -needed. If writes take a long time, but reads are fine, what is the difference -between the trigger based replicator? - -I have in the back of my mind, an idea of patching into the WAL stuff, and -using that mechanism to push changes out to the slaves. - -Where one machine is still the master, but no trigger stuff, just a WAL patch. -Perhaps some shared memory paradigm to manage WAL visibility? I'm not sure -exactly, the idea hasn't completely formed yet. - ----------------------------(end of broadcast)--------------------------- -TIP 5: Have you checked our extensive FAQ? - -http://www.postgresql.org/users-lounge/docs/faq.html - -From pgsql-hackers-owner+M18574=candle.pha.pa.us=pgman@postgresql.org Thu Feb 7 12:51:42 2002 -Return-path: -Received: from server1.pgsql.org (www.postgresql.org [64.49.215.9]) - by candle.pha.pa.us (8.11.6/8.10.1) with SMTP id g17HpfP16661 - for ; Thu, 7 Feb 2002 12:51:41 -0500 (EST) -Received: (qmail 62955 invoked by alias); 7 Feb 2002 17:50:42 -0000 -Received: from unknown (HELO postgresql.org) (64.49.215.8) - by www.postgresql.org with SMTP; 7 Feb 2002 17:50:42 -0000 -Received: from www1.navtechinc.com ([192.234.226.140]) - by postgresql.org (8.11.3/8.11.4) with ESMTP id g17HnTE62256 - for ; Thu, 7 Feb 2002 12:49:29 -0500 (EST) - (envelope-from ssinger@navtechinc.com) -Received: from pcNavYkfAdm1.ykf.navtechinc.com (wall [192.234.226.190]) - by www1.navtechinc.com (8.9.3/8.9.3) with ESMTP id RAA07908; - Thu, 7 Feb 2002 17:49:31 GMT -Received: from localhost (ssinger@localhost) - by pcNavYkfAdm1.ykf.navtechinc.com (8.9.3/8.9.3) with ESMTP id RAA05687; - Thu, 7 Feb 2002 17:48:52 GMT -Date: Thu, 7 Feb 2002 17:48:51 +0000 (GMT) -From: Steven Singer -X-X-Sender: -To: Gavin Sherry -cc: mlw , - PostgreSQL-development -Subject: Re: [HACKERS] Replication -In-Reply-To: -Message-ID: -MIME-Version: 1.0 -Content-Type: TEXT/PLAIN; charset=US-ASCII -Precedence: bulk -Sender: pgsql-hackers-owner@postgresql.org -Status: OR - - - -What you describe sounds like a form of a two-stage commit protocol. - -If the command worked on two of the replicated databases but failed on a -third then the executing server would have to be able to undo the command -on the replicated databases as well as itself. - -The problems with two stage commit type approches to replication are -1) Speed as you mentioned. Write speed isn't a concern for some -applications but it is very important in others. - -and -2) All of the databases must be able to communicate with each other at -all times in order for any edits to work. If the servers are -connected over some sort of WAN that periodically has short outages this -is a problem. Also if your using replication because you want to be able -to take down one of the databases for short periods of time without -bringing down the others your in trouble. - - -btw: I posted the alternative to Rserv that I mentioned the other day to -the pg-patches mailing list. If anyone is intreasted you should be able -to grab it off the archives. - -On Thu, 7 Feb 2002, Gavin Sherry wrote: - -> -> First of all, all machines in the cluster would have to be aware all the -> machines in the cluster. This would have to be stored in a new system -> table. -> -> The FE/BE protocol would need to be modified to accepted parsed node trees -> generated by pg_analyze_and_rewrite(). These could then be dispatched by -> the executing server, inside of pg_exec_query_string, to all other servers -> in the cluster (excluding itself). Naturally, this dispatch would need to -> be non-blocking. -> -> pg_exec_query_string() would need to check that nodetags to make sure -> selects and perhaps some commands are not dispatched. -> -> Before the executing server runs finish_xact_command(), it would check -> that the query was successfully executed on all machines otherwise -> abort. Such a system would need a few configuration options: whether or -> not you abort on failed replication to slaves, the ability to replicate -> only certain tables, etc. -> -> Naturally, this would slow down writes to the system (possibly a lot -> depending on the performance difference between the executing machine and -> the least powerful machine in the cluster), but most usages of postgresql -> are read intensive, not write. -> -> Any reason this model would not work? -> -> Gavin -> -> -> ---------------------------(end of broadcast)--------------------------- -> TIP 4: Don't 'kill -9' the postmaster -> - --- -Steven Singer ssinger@navtechinc.com -Aircraft Performance Systems Phone: 519-747-1170 ext 282 -Navtech Systems Support Inc. AFTN: CYYZXNSX SITA: YYZNSCR -Waterloo, Ontario ARINC: YKFNSCR - - ----------------------------(end of broadcast)--------------------------- -TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org - -From pgsql-hackers-owner+M18590=candle.pha.pa.us=pgman@postgresql.org Thu Feb 7 17:50:42 2002 -Return-path: -Received: from server1.pgsql.org (www.postgresql.org [64.49.215.9]) - by candle.pha.pa.us (8.11.6/8.10.1) with SMTP id g17MoeP27121 - for ; Thu, 7 Feb 2002 17:50:40 -0500 (EST) -Received: (qmail 39930 invoked by alias); 7 Feb 2002 22:50:17 -0000 -Received: from unknown (HELO postgresql.org) (64.49.215.8) - by www.postgresql.org with SMTP; 7 Feb 2002 22:50:17 -0000 -Received: from odin.fts.net (wall.icgate.net [209.26.177.2]) - by postgresql.org (8.11.3/8.11.4) with ESMTP id g17Ma4E38041 - for ; Thu, 7 Feb 2002 17:36:04 -0500 (EST) - (envelope-from fharvell@odin.fts.net) -Received: from odin.fts.net (fharvell@localhost) - by odin.fts.net (8.11.6/8.11.6) with ESMTP id g17MZhR17707; - Thu, 7 Feb 2002 17:35:43 -0500 -Message-ID: <200202072235.g17MZhR17707@odin.fts.net> -X-Mailer: exmh version 2.2 06/23/2000 with nmh-1.0.4 -From: F Harvell -To: mlw -cc: Gavin Sherry , - PostgreSQL-development -Subject: Re: [HACKERS] Replication -In-Reply-To: Message from mlw - of "Thu, 07 Feb 2002 07:52:23 EST." - <3C627887.CC9FF837@mohawksoft.com> -MIME-Version: 1.0 -Content-Type: text/plain; charset=us-ascii -Date: Thu, 07 Feb 2002 17:35:43 -0500 -Precedence: bulk -Sender: pgsql-hackers-owner@postgresql.org -Status: OR - -I'm not that familiar with the whole replication issues in PostgreSQL, -however, I would be partial to replication that was based upon the -playback of the (a?) journal file. (I believe that the WAL is a -journal file.) - -By being based upon a journal file, it would be possible to accomplish -two significant items. First, it would be possible to "restore" a -database to an exact state just before a failure. Most commercial -databases provide the ability to do this. Banks, etc. log the journal -files directly to tape to provide a complete transaction history such -that they can rebuild their database from any given snapshot. (Note -that the journal file needs to be "editable" as a failure may be -"delete from x" with a missing where clause.) - -This leads directly into the second advantage, the ability to have a -replicated database operating anywhere, over any connection on any -server. Speed of writes would not be a factor. In essence, as long -as the replicated database had a snapshot of the database and then was -provided with all journal files since the snapshot, it would be -possible to build a current database. If the replicant got behind in -the processing, it would catch up when things slowed down. - -In my opionion, the first advantage is in many ways most important. -Replication becomes simply the restoration of the database in realtime -on a second server. The "replication" task becomes the definition of -a protocol for distributing the journal file. At least one major -database vendor does replication (shadowing) in exactly this mannor. - -Maybe I'm all wet and the journal file and journal playback already -exists. If so, IMHO, basing replication off of this would be the -right direction. - - -On Thu, 07 Feb 2002 07:52:23 EST, mlw wrote: -> -> I have in the back of my mind, an idea of patching into the WAL stuff, and -> using that mechanism to push changes out to the slaves. -> -> Where one machine is still the master, but no trigger stuff, just a WAL patch. -> Perhaps some shared memory paradigm to manage WAL visibility? I'm not sure -> exactly, the idea hasn't completely formed yet. -> - - - ----------------------------(end of broadcast)--------------------------- -TIP 4: Don't 'kill -9' the postmaster - -From pgsql-hackers-owner+M18605=candle.pha.pa.us=pgman@postgresql.org Fri Feb 8 00:50:08 2002 -Return-path: -Received: from server1.pgsql.org (www.postgresql.org [64.49.215.9]) - by candle.pha.pa.us (8.11.6/8.10.1) with SMTP id g185o7P27878 - for ; Fri, 8 Feb 2002 00:50:07 -0500 (EST) -Received: (qmail 17348 invoked by alias); 8 Feb 2002 05:50:03 -0000 -Received: from unknown (HELO postgresql.org) (64.49.215.8) - by www.postgresql.org with SMTP; 8 Feb 2002 05:50:03 -0000 -Received: from lakemtao03.mgt.cox.net (mtao3.east.cox.net [68.1.17.242]) - by postgresql.org (8.11.3/8.11.4) with ESMTP id g185cTE15241 - for ; Fri, 8 Feb 2002 00:38:29 -0500 (EST) - (envelope-from darren.johnson@cox.net) -Received: from cox.net ([68.10.181.230]) by lakemtao03.mgt.cox.net - (InterMail vM.5.01.04.05 201-253-122-122-105-20011231) with ESMTP - id <20020208053833.YKTV6710.lakemtao03.mgt.cox.net@cox.net> - for ; - Fri, 8 Feb 2002 00:38:33 -0500 -Message-ID: <3C636232.6060206@cox.net> -Date: Fri, 08 Feb 2002 00:29:22 -0500 -From: Darren Johnson -User-Agent: Mozilla/5.0 (Windows; U; WinNT4.0; en-US; m18) Gecko/20001108 Netscape6/6.0 -X-Accept-Language: en -MIME-Version: 1.0 -To: PostgreSQL-development -Subject: Re: [HACKERS] Replication -References: -Content-Type: text/plain; charset=us-ascii; format=flowed -Content-Transfer-Encoding: 7bit -Precedence: bulk -Sender: pgsql-hackers-owner@postgresql.org -Status: OR - - - > - > The problems with two stage commit type approches to replication are - -IMHO the biggest problem with two phased commit is it doesn't scale. -The more servers -you add to the replica the slower it goes. Also there's the potential -for dead locks across -server boundaries. - - > - > 2) All of the databases must be able to communicate with each other at - > all times in order for any edits to work. If the servers are - > connected over some sort of WAN that periodically has short outages this - > is a problem. Also if your using replication because you want to be -able - > to take down one of the databases for short periods of time without - > bringing down the others your in trouble. - -All true for two phased commit protocol. To have multi master -replication, you must have all -systems communicating, but you can use a multicast group communication -system instead of -2PC. Using total order messaging, you can ensure all changes are -delivered to all servers in the -replica in the same order. This group communication system also allows -failures to be detected -while other servers in the replica continue processing. - -A few of us are working with this theory, and trying to integrate with -7.2. There is a working -model for 6.4, but its very limited. (insert, update, and deletes) We -are currently hosted at - -http://gborg.postgresql.org/project/pgreplication/projdisplay.php -But the site has been down the last 2 days. I've contacted the web -master, but haven't seen -any results yet. If any one knows what going on with gborg, I'd -appreciate a status. - -Darren - - ----------------------------(end of broadcast)--------------------------- -TIP 2: you can get off all lists at once with the unregister command - (send "unregister YourEmailAddressHere" to majordomo@postgresql.org) - -From pgsql-hackers-owner+M18617=candle.pha.pa.us=pgman@postgresql.org Fri Feb 8 06:20:44 2002 -Return-path: -Received: from server1.pgsql.org (www.postgresql.org [64.49.215.9]) - by candle.pha.pa.us (8.11.6/8.10.1) with SMTP id g18BKhP06132 - for ; Fri, 8 Feb 2002 06:20:43 -0500 (EST) -Received: (qmail 90815 invoked by alias); 8 Feb 2002 11:20:40 -0000 -Received: from unknown (HELO postgresql.org) (64.49.215.8) - by www.postgresql.org with SMTP; 8 Feb 2002 11:20:40 -0000 -Received: from laptop.kieser.demon.co.uk (kieser.demon.co.uk [62.49.6.72]) - by postgresql.org (8.11.3/8.11.4) with ESMTP id g18B9ZE89589 - for ; Fri, 8 Feb 2002 06:09:36 -0500 (EST) - (envelope-from brad@kieser.net) -Received: from laptop.kieser.demon.co.uk (localhost.localdomain [127.0.0.1]) - by laptop.kieser.demon.co.uk (Postfix) with SMTP - id 598393A132; Fri, 8 Feb 2002 11:09:36 +0000 (GMT) -From: Bradley Kieser -Date: Fri, 08 Feb 2002 11:09:36 GMT -Message-ID: <20020208.11093600@laptop.kieser.demon.co.uk> -Subject: Re: [HACKERS] Replication -To: Darren Johnson -cc: PostgreSQL-development -In-Reply-To: <3C636232.6060206@cox.net> -References: <3C636232.6060206@cox.net> -X-Mailer: Mozilla/3.0 (compatible; StarOffice/5.2;Linux) -X-Priority: 3 (Normal) -MIME-Version: 1.0 -Content-Type: text/plain; charset=ISO-8859-1 -Content-Transfer-Encoding: 8bit -X-MIME-Autoconverted: from quoted-printable to 8bit by postgresql.org id g18BJoF90352 -Precedence: bulk -Sender: pgsql-hackers-owner@postgresql.org -Status: OR - -Darren, -Given that different replication strategies will probably be developed -for PG, do you envisage DBAs to be able to select the type of replication -for their installation? I.e. Replication being selectable rther like -storage structures? - -Would be a killer bit of flexibility, given how enormous the impact of -replication will be to corporate adoption of PG. - -Brad - - ->>>>>>>>>>>>>>>>>> Original Message <<<<<<<<<<<<<<<<<< - -On 2/8/02, 5:29:22 AM, Darren Johnson wrote -regarding Re: [HACKERS] Replication: - - -> > -> > The problems with two stage commit type approches to replication are - -> IMHO the biggest problem with two phased commit is it doesn't scale. -> The more servers -> you add to the replica the slower it goes. Also there's the potential -> for dead locks across -> server boundaries. - -> > -> > 2) All of the databases must be able to communicate with each other at -> > all times in order for any edits to work. If the servers are -> > connected over some sort of WAN that periodically has short outages this -> > is a problem. Also if your using replication because you want to be -> able -> > to take down one of the databases for short periods of time without -> > bringing down the others your in trouble. - -> All true for two phased commit protocol. To have multi master -> replication, you must have all -> systems communicating, but you can use a multicast group communication -> system instead of -> 2PC. Using total order messaging, you can ensure all changes are -> delivered to all servers in the -> replica in the same order. This group communication system also allows -> failures to be detected -> while other servers in the replica continue processing. - -> A few of us are working with this theory, and trying to integrate with -> 7.2. There is a working -> model for 6.4, but its very limited. (insert, update, and deletes) We -> are currently hosted at - -> http://gborg.postgresql.org/project/pgreplication/projdisplay.php -> But the site has been down the last 2 days. I've contacted the web -> master, but haven't seen -> any results yet. If any one knows what going on with gborg, I'd -> appreciate a status. - -> Darren - - -> ---------------------------(end of broadcast)--------------------------- -> TIP 2: you can get off all lists at once with the unregister command -> (send "unregister YourEmailAddressHere" to majordomo@postgresql.org) - ----------------------------(end of broadcast)--------------------------- -TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org - -From pgsql-hackers-owner+M18642=candle.pha.pa.us=pgman@postgresql.org Fri Feb 8 12:40:36 2002 -Return-path: -Received: from server1.pgsql.org (www.postgresql.org [64.49.215.9]) - by candle.pha.pa.us (8.11.6/8.10.1) with SMTP id g18HeZP08450 - for ; Fri, 8 Feb 2002 12:40:35 -0500 (EST) -Received: (qmail 74089 invoked by alias); 8 Feb 2002 17:40:30 -0000 -Received: from unknown (HELO postgresql.org) (64.49.215.8) - by www.postgresql.org with SMTP; 8 Feb 2002 17:40:30 -0000 -Received: from lakemtao03.mgt.cox.net (mtao3.east.cox.net [68.1.17.242]) - by postgresql.org (8.11.3/8.11.4) with ESMTP id g18HbwE73437 - for ; Fri, 8 Feb 2002 12:37:58 -0500 (EST) - (envelope-from darren.johnson@cox.net) -Received: from cox.net ([68.10.181.230]) by lakemtao03.mgt.cox.net - (InterMail vM.5.01.04.05 201-253-122-122-105-20011231) with ESMTP - id <20020208173804.DKQS6710.lakemtao03.mgt.cox.net@cox.net>; - Fri, 8 Feb 2002 12:38:04 -0500 -Message-ID: <3C63FB71.206@cox.net> -Date: Fri, 08 Feb 2002 11:23:13 -0500 -From: Darren Johnson -User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; m18) Gecko/20010131 Netscape6/6.01 -X-Accept-Language: en -MIME-Version: 1.0 -To: Bradley Kieser -cc: pgsql-hackers@postgresql.org -Subject: Re: [HACKERS] Replication -References: <3C636232.6060206@cox.net> <20020208.11093600@laptop.kieser.demon.co.uk> -Content-Type: text/plain; charset=us-ascii; format=flowed -Content-Transfer-Encoding: 7bit -Precedence: bulk -Sender: pgsql-hackers-owner@postgresql.org -Status: OR - -> -> Given that different replication strategies will probably be developed -> for PG, do you envisage DBAs to be able to select the type of replication -> for their installation? I.e. Replication being selectable rther like -> storage structures? - -I can't speak for other replication solutions, but we are using the ---with-replication or --r parameter when starting postmaster. Some day I hope there will be -parameters for -master/slave partial/full and sync/async, but it will be some time -before we cross those -bridges. - -Darren - - - - ----------------------------(end of broadcast)--------------------------- -TIP 6: Have you searched our list archives? - -http://archives.postgresql.org - -From pgsql-hackers-owner+M18658=candle.pha.pa.us=pgman@postgresql.org Fri Feb 8 14:42:40 2002 -Return-path: -Received: from server1.pgsql.org (www.postgresql.org [64.49.215.9]) - by candle.pha.pa.us (8.11.6/8.10.1) with SMTP id g18JgdP28166 - for ; Fri, 8 Feb 2002 14:42:39 -0500 (EST) -Received: (qmail 18650 invoked by alias); 8 Feb 2002 19:42:39 -0000 -Received: from unknown (HELO postgresql.org) (64.49.215.8) - by www.postgresql.org with SMTP; 8 Feb 2002 19:42:39 -0000 -Received: from enigma.trueimpact.net (enigma.trueimpact.net [209.82.45.201]) - by postgresql.org (8.11.3/8.11.4) with ESMTP id g18JYBE17341 - for ; Fri, 8 Feb 2002 14:34:11 -0500 (EST) - (envelope-from rjonasz@trueimpact.com) -Received: from nietzsche.trueimpact.net (unknown [209.82.45.200]) - by enigma.trueimpact.net (Postfix) with ESMTP id A785066B04 - for ; Fri, 8 Feb 2002 14:33:28 -0500 (EST) -Date: Fri, 8 Feb 2002 14:34:34 -0500 (EST) -From: Randall Jonasz -X-X-Sender: -To: PostgreSQL-development -Subject: Re: [HACKERS] Replication -In-Reply-To: <3C627887.CC9FF837@mohawksoft.com> -Message-ID: <20020208142932.H6545-100000@nietzsche.trueimpact.net> -MIME-Version: 1.0 -Content-Type: TEXT/PLAIN; charset=US-ASCII -Precedence: bulk -Sender: pgsql-hackers-owner@postgresql.org -Status: OR - -I've been looking into database replication theory lately and have found -some interesting papers discussing various approaches. (Here's -one paper that struck me as being very helpful, -http://citeseer.nj.nec.com/460405.html ) So far I favour an -eager replication system which is predicated on a read local/write all -available. The system should not depend on two phase commit or primary -copy algorithms. The former leads to the whole system being as quick as -the slowest machine. In addition, 2 phase commit involves 2n messages for -each transaction which does not scale well at all. This idea will also -have to take into account a crashed node which did not ack a transaction. -The primary copy algorithms I've seen suffer from a single point of -failure and potential bottlenecks at the primary node. - -Instead I like the master to master or peer to peer algorithm as discussed -in the above paper. This approach accounts for network partitions, nodes -leaving and joining a cluster and the ability to commit a transaction once -the communication module has determined the total order of the said -transaction, i.e. no need for waiting for acks. This scales well and -research has shown it to increase the number of transactions/second a -database cluster can handle over a single node. - -Postgres-R is another interesting approach which I think should be taken -seriously. Anyone interested can read a paper on this at -http://citeseer.nj.nec.com/330257.html - -Anyways, my two cents - -Randall Jonasz -Software Engineer -Click2net Inc. - - -On Thu, 7 Feb 2002, mlw wrote: - -> Gavin Sherry wrote: -> > Naturally, this would slow down writes to the system (possibly a lot -> > depending on the performance difference between the executing machine and -> > the least powerful machine in the cluster), but most usages of postgresql -> > are read intensive, not write. -> > -> > Any reason this model would not work? -> -> What, then is the purpose of replication to multiple masters? -> -> I can think of only two reasons why you want replication. (1) Redundancy, make -> sure that if one server dies, then another server has the same data and is used -> seamlessly. (2) Increase performance over one system. -> -> In reason (1) I submit that a server load balance which sits on top of -> PostgreSQL, and executes writes on both servers while distributing reads would -> be best. This is a HUGE project. The load balancer must know EXACTLY how the -> system is configured, which includes all functions and everything. -> -> In reason (2) your system would fail to provide the scalability that would be -> needed. If writes take a long time, but reads are fine, what is the difference -> between the trigger based replicator? -> -> I have in the back of my mind, an idea of patching into the WAL stuff, and -> using that mechanism to push changes out to the slaves. -> -> Where one machine is still the master, but no trigger stuff, just a WAL patch. -> Perhaps some shared memory paradigm to manage WAL visibility? I'm not sure -> exactly, the idea hasn't completely formed yet. -> -> ---------------------------(end of broadcast)--------------------------- -> TIP 5: Have you checked our extensive FAQ? -> -> http://www.postgresql.org/users-lounge/docs/faq.html -> -> - - ----------------------------(end of broadcast)--------------------------- -TIP 5: Have you checked our extensive FAQ? - -http://www.postgresql.org/users-lounge/docs/faq.html - -From pgsql-hackers-owner+M18660=candle.pha.pa.us=pgman@postgresql.org Fri Feb 8 15:20:32 2002 -Return-path: -Received: from server1.pgsql.org (www.postgresql.org [64.49.215.9]) - by candle.pha.pa.us (8.11.6/8.10.1) with SMTP id g18KKSP03731 - for ; Fri, 8 Feb 2002 15:20:29 -0500 (EST) -Received: (qmail 28961 invoked by alias); 8 Feb 2002 20:20:27 -0000 -Received: from unknown (HELO postgresql.org) (64.49.215.8) - by www.postgresql.org with SMTP; 8 Feb 2002 20:20:27 -0000 -Received: from inflicted.crimelabs.net (crimelabs.net [66.92.101.112]) - by postgresql.org (8.11.3/8.11.4) with ESMTP id g18KC7E27667 - for ; Fri, 8 Feb 2002 15:12:07 -0500 (EST) - (envelope-from bpalmer@crimelabs.net) -Received: from mizer.crimelabs.net (mizer.crimelabs.net [192.168.88.10]) - by inflicted.crimelabs.net (Postfix) with ESMTP - id 1066F8787; Fri, 8 Feb 2002 15:12:08 -0500 (EST) -Date: Fri, 8 Feb 2002 15:12:00 -0500 (EST) -From: bpalmer -To: Randall Jonasz -cc: PostgreSQL-development -Subject: Re: [HACKERS] Replication -In-Reply-To: <20020208142932.H6545-100000@nietzsche.trueimpact.net> -Message-ID: -MIME-Version: 1.0 -Content-Type: TEXT/PLAIN; charset=US-ASCII -Precedence: bulk -Sender: pgsql-hackers-owner@postgresql.org -Status: OR - -I've not looked at the first paper, but I wil. - -> Postgres-R is another interesting approach which I think should be taken -> seriously. Anyone interested can read a paper on this at -> http://citeseer.nj.nec.com/330257.html - -I would point you to the info on gborg, but it seems to be down at the -moment. - -- Brandon - ----------------------------------------------------------------------------- - c: 646-456-5455 h: 201-798-4983 - b. palmer, bpalmer@crimelabs.net pgp:crimelabs.net/bpalmer.pgp5 - - ----------------------------(end of broadcast)--------------------------- -TIP 3: if posting/reading through Usenet, please send an appropriate -subscribe-nomail command to majordomo@postgresql.org so that your -message can get through to the mailing list cleanly - -From pgsql-hackers-owner+M18666=candle.pha.pa.us=pgman@postgresql.org Fri Feb 8 17:41:03 2002 -Return-path: -Received: from server1.pgsql.org (www.postgresql.org [64.49.215.9]) - by candle.pha.pa.us (8.11.6/8.10.1) with SMTP id g18Mf2P18046 - for ; Fri, 8 Feb 2002 17:41:03 -0500 (EST) -Received: (qmail 63057 invoked by alias); 8 Feb 2002 22:41:02 -0000 -Received: from unknown (HELO postgresql.org) (64.49.215.8) - by www.postgresql.org with SMTP; 8 Feb 2002 22:41:02 -0000 -Received: from lakemtao03.mgt.cox.net (mtao3.east.cox.net [68.1.17.242]) - by postgresql.org (8.11.3/8.11.4) with ESMTP id g18MR9E60361 - for ; Fri, 8 Feb 2002 17:27:11 -0500 (EST) - (envelope-from darren.johnson@cox.net) -Received: from cox.net ([68.10.181.230]) by lakemtao03.mgt.cox.net - (InterMail vM.5.01.04.05 201-253-122-122-105-20011231) with ESMTP - id <20020208222634.GTRG6710.lakemtao03.mgt.cox.net@cox.net>; - Fri, 8 Feb 2002 17:26:34 -0500 -Message-ID: <3C643F0F.70303@cox.net> -Date: Fri, 08 Feb 2002 16:11:43 -0500 -From: Darren Johnson -User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; m18) Gecko/20010131 Netscape6/6.01 -X-Accept-Language: en -MIME-Version: 1.0 -To: Randall Jonasz -cc: PostgreSQL-development -Subject: Re: [HACKERS] Replication -References: <20020208142932.H6545-100000@nietzsche.trueimpact.net> -Content-Type: text/plain; charset=us-ascii; format=flowed -Content-Transfer-Encoding: 7bit -Precedence: bulk -Sender: pgsql-hackers-owner@postgresql.org -Status: OR - - -> I've been looking into database replication theory lately and have found -> some interesting papers discussing various approaches. (Here's -> one paper that struck me as being very helpful, -> http://citeseer.nj.nec.com/460405.html ) - - -Here is another one from that same group, that addresses the WAN issues. - -> http://www.cnds.jhu.edu/pub/papers/cnds-2002-1.pdf - - -enjoy, - -Darren - - - - ----------------------------(end of broadcast)--------------------------- -TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org - -From pgsql-hackers-owner+M18674=candle.pha.pa.us=pgman@postgresql.org Fri Feb 8 19:20:30 2002 -Return-path: -Received: from server1.pgsql.org (www.postgresql.org [64.49.215.9]) - by candle.pha.pa.us (8.11.6/8.10.1) with SMTP id g190KTP26980 - for ; Fri, 8 Feb 2002 19:20:29 -0500 (EST) -Received: (qmail 88124 invoked by alias); 9 Feb 2002 00:20:27 -0000 -Received: from unknown (HELO postgresql.org) (64.49.215.8) - by www.postgresql.org with SMTP; 9 Feb 2002 00:20:27 -0000 -Received: from localhost.localdomain (bgp01077650bgs.wanarb01.mi.comcast.net [68.40.135.112]) - by postgresql.org (8.11.3/8.11.4) with ESMTP id g190H3E87489 - for ; Fri, 8 Feb 2002 19:17:03 -0500 (EST) - (envelope-from camber@ais.org) -Received: from localhost (camber@localhost) - by localhost.localdomain (8.11.6/8.11.6) with ESMTP id g190H0P18427; - Fri, 8 Feb 2002 19:17:00 -0500 -X-Authentication-Warning: localhost.localdomain: camber owned process doing -bs -Date: Fri, 8 Feb 2002 19:17:00 -0500 (EST) -From: Brian Bruns -X-X-Sender: -To: Randall Jonasz -cc: PostgreSQL-development -Subject: Re: [HACKERS] Replication -In-Reply-To: <20020208142932.H6545-100000@nietzsche.trueimpact.net> -Message-ID: -MIME-Version: 1.0 -Content-Type: TEXT/PLAIN; charset=US-ASCII -Precedence: bulk -Sender: pgsql-hackers-owner@postgresql.org -Status: OR - -> > I have in the back of my mind, an idea of patching into the WAL stuff, and -> > using that mechanism to push changes out to the slaves. -> > -> > Where one machine is still the master, but no trigger stuff, just a WAL patch. -> > Perhaps some shared memory paradigm to manage WAL visibility? I'm not sure -> > exactly, the idea hasn't completely formed yet. -> > - -FWIW, Sybase Replication Server does just such a thing. - -They have a secondary log marker (prevents the log from truncating past -the oldest unreplicated transaction). A thread within the system called -the "rep agent" (but it use to be a separate process call the LTM), reads -the log and forwards it to the rep server, once the rep server has the -whole transaction and it is written to a stable device (aka synced to -disk) the rep server responds to the LTM telling him it's OK to move the -log marker forward. - -Anyway, once the replication server proper has the transaction it uses a -publish/subscribe methodology to see who wants get the update. - -Bidirectional replication is done by making two oneway replications. The -whole thing is table based, it marks the tables as replicated or not in -the database to save the trip to the repserver on un replicated tables. - -Plus you can take parts of a database (replicate all rows where the -country is "us" to this server and all the rows with "uk" to that server). -Or opposite you can roll up smaller regional databases to bigger ones, -it's very flexible. - - -Cheers, - -Brian - - ----------------------------(end of broadcast)--------------------------- -TIP 4: Don't 'kill -9' the postmaster - diff --git a/doc/TODO.detail/typeconv b/doc/TODO.detail/typeconv deleted file mode 100644 index 52cc1ff00d..0000000000 --- a/doc/TODO.detail/typeconv +++ /dev/null @@ -1,916 +0,0 @@ -From pgsql-hackers-owner+M1833@hub.org Sat May 13 22:49:26 2000 -Received: from news.tht.net (news.hub.org [216.126.91.242]) - by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id WAA07394 - for ; Sat, 13 May 2000 22:49:24 -0400 (EDT) -Received: from hub.org (majordom@hub.org [216.126.84.1]) - by news.tht.net (8.9.3/8.9.3) with ESMTP id WAB99859; - Sat, 13 May 2000 22:44:15 -0400 (EDT) - (envelope-from pgsql-hackers-owner+M1833@hub.org) -Received: from sss2.sss.pgh.pa.us (sss.pgh.pa.us [209.114.166.2]) - by hub.org (8.9.3/8.9.3) with ESMTP id WAA51058 - for ; Sat, 13 May 2000 22:41:16 -0400 (EDT) - (envelope-from tgl@sss.pgh.pa.us) -Received: from sss2.sss.pgh.pa.us (tgl@localhost [127.0.0.1]) - by sss2.sss.pgh.pa.us (8.9.3/8.9.3) with ESMTP id WAA18343 - for ; Sat, 13 May 2000 22:40:38 -0400 (EDT) -To: pgsql-hackers@postgresql.org -Subject: [HACKERS] Proposal for fixing numeric type-resolution issues -Date: Sat, 13 May 2000 22:40:38 -0400 -Message-ID: <18340.958272038@sss.pgh.pa.us> -From: Tom Lane -X-Mailing-List: pgsql-hackers@postgresql.org -Precedence: bulk -Sender: pgsql-hackers-owner@hub.org -Status: ORr - -We've got a collection of problems that are related to the parser's -inability to make good type-resolution choices for numeric constants. -In some cases you get a hard error; for example "NumericVar + 4.4" -yields -ERROR: Unable to identify an operator '+' for types 'numeric' and 'float8' - You will have to retype this query using an explicit cast -because "4.4" is initially typed as float8 and the system can't figure -out whether to use numeric or float8 addition. A more subtle problem -is that a query like "... WHERE Int2Var < 42" is unable to make use of -an index on the int2 column: 42 is resolved as int4, so the operator -is int24lt, which works but is not in the opclass of an int2 index. - -Here is a proposal for fixing these problems. I think we could get this -done for 7.1 if people like it. - -The basic problem is that there's not enough smarts in the type resolver -about the interrelationships of the numeric datatypes. All it has is -a concept of a most-preferred type within the category of numeric types. -(We are abusing the most-preferred-type mechanism, BTW, because both -FLOAT8 and NUMERIC claim to be the most-preferred type in the numeric -category! This is in fact why the resolver can't make a choice for -"numeric+float8".) We need more intelligence than that. - -I propose that we set up a strictly-ordered hierarchy of numeric -datatypes, running from least preferred to most preferred: - int2, int4, int8, numeric, float4, float8. -Rather than simply considering coercions to the most-preferred type, -the type resolver should use the following rules: - -1. No value will be down-converted (eg int4 to int2) except by an -explicit conversion. - -2. If there is not an exact matching operator, numeric values will be -up-converted to the highest numeric datatype present among the operator -or function's arguments. For example, given "int2 + int8" we'd up- -convert the int2 to int8 and apply int8 addition. - -The final piece of the puzzle is that the type initially assigned to -an undecorated numeric constant should be NUMERIC if it contains a -decimal point or exponent, and otherwise the smallest of int2, int4, -int8, NUMERIC that will represent it. This is a considerable change -from the current lexer behavior, where you get either int4 or float8. - -For example, given "NumericVar + 4.4", the constant 4.4 will initially -be assigned type NUMERIC, we will resolve the operator as numeric plus, -and everything's fine. Given "Float8Var + 4.4", the constant is still -initially numeric, but will be up-converted to float8 so that float8 -addition can be used. The end result is the same as in traditional -Postgres: you get float8 addition. Given "Int2Var < 42", the constant -is initially typed as int2, since it fits, and we end up selecting -int2lt, thereby allowing use of an int2 index. (On the other hand, -given "Int2Var < 100000", we'd end up using int4lt, which is correct -to avoid overflow.) - -A couple of crucial subtleties here: - -1. We are assuming that the parser or optimizer will constant-fold -any conversion functions that are introduced. Thus, in the -"Float8Var + 4.4" case, the 4.4 is represented as a float8 4.4 by the -time execution begins, so there's no performance loss. - -2. We cannot lose precision by initially representing a constant as -numeric and later converting it to float. Nor can we exceed NUMERIC's -range (the default 1000-digit limit is more than the range of IEEE -float8 data). It would not work as well to start out by representing -a constant as float and then converting it to numeric. - -Presently, the pg_proc and pg_operator tables contain a pretty fair -collection of cross-datatype numeric operators, such as int24lt, -float48pl, etc. We could perhaps leave these in, but I believe that -it is better to remove them. For example, if int42lt is left in place, -then it would capture cases like "Int4Var < 42", whereas we need that -to be translated to int4lt so that an int4 index can be used. Removing -these operators will eliminate some code bloat and system-catalog bloat -to boot. - -As far as I can tell, this proposal is almost compatible with the rules -given in SQL92: in particular, SQL92 specifies that an operator having -both "approximate numeric" (float) and "exact numeric" (int or numeric) -inputs should deliver an approximate-numeric result. I propose -deviating from SQL92 in a single respect: SQL92 specifies that a -constant containing an exponent (eg 1.2E34) is approximate numeric, -which implies that the result of an operator using it is approximate -even if the other operand is exact. I believe it's better to treat -such a constant as exact (ie, type NUMERIC) and only convert it to -float if the other operand is float. Without doing that, an assignment -like - UPDATE tab SET NumericVar = 1.234567890123456789012345E34; -will not work as desired because the constant will be prematurely -coerced to float, causing precision loss. - -Comments? - - regards, tom lane - -From tgl@sss.pgh.pa.us Sun May 14 17:30:56 2000 -Received: from renoir.op.net (root@renoir.op.net [207.29.195.4]) - by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id RAA05808 - for ; Sun, 14 May 2000 17:30:52 -0400 (EDT) -Received: from sss2.sss.pgh.pa.us (sss.pgh.pa.us [209.114.166.2]) by renoir.op.net (o1/$Revision: 1.4 $) with ESMTP id RAA16657 for ; Sun, 14 May 2000 17:29:52 -0400 (EDT) -Received: from sss2.sss.pgh.pa.us (tgl@localhost [127.0.0.1]) - by sss2.sss.pgh.pa.us (8.9.3/8.9.3) with ESMTP id RAA20914; - Sun, 14 May 2000 17:29:30 -0400 (EDT) -To: Bruce Momjian -cc: PostgreSQL-development -Subject: Re: [HACKERS] type conversion discussion -In-reply-to: <200005141950.PAA04636@candle.pha.pa.us> -References: <200005141950.PAA04636@candle.pha.pa.us> -Comments: In-reply-to Bruce Momjian - message dated "Sun, 14 May 2000 15:50:20 -0400" -Date: Sun, 14 May 2000 17:29:30 -0400 -Message-ID: <20911.958339770@sss.pgh.pa.us> -From: Tom Lane -Status: OR - -Bruce Momjian writes: -> As some point, it seems we need to get all the PostgreSQL minds together -> to discuss type conversion issues. These problems continue to come up -> from release to release. We are getting better, but it seems a full -> discussion could help solidify our strategy. - -OK, here are a few things that bug me about the current type-resolution -code: - -1. Poor choice of type to attribute to numeric literals. (A possible - solution is sketched in my earlier message, but do we need similar - mechanisms for other type categories?) - -2. Tensions between treating string literals as "unknown" type and - as "text" type, per this thread so far. - -3. IS_BINARY_COMPATIBLE seems like a bogus concept. Do we really want a - fully symmetrical ring of types in each group? I'd prefer to see a - one-way equivalence, which allows eg. OID to be silently converted - to INT4, but *not* vice versa (except perhaps by specific user cast). - This'd be more like a traditional "is-a" or inheritance relationship - between datatypes, which has well-understood semantics. - -4. I'm also concerned that the behavior of IS_BINARY_COMPATIBLE isn't - very predictable because it will happily go either way. For example, - if I do - select * from pg_class where oid = 1234; - it's unclear whether I will get an oideq or an int4eq operator --- - and that's a rather critical point since only one of them can exploit - an index on the oid column. Currently, there is some klugery in the - planner that works around this by overriding the parser's choice of - operator to substitute one that is compatible with an available index. - That's a pretty ugly solution ... I'm not sure I know a better one, - but as long as we're discussing type resolution issues ... - -5. Lack of extensibility. There's way too much knowledge hard-wired - into the parser about type categories, preferred types, binary - compatibility, etc. All of it falls down when faced with - user-defined datatypes. If we do something like I suggested with - a hardwired hierarchy of numeric datatypes, it'll get even worse. - All this stuff ought to be driven off fields in pg_type rather than - be hardwired into the code, so that the same concepts can be extended - to user-defined types. - -I don't have worked-out proposals for any of these but the first, -but they've all been bothering me for a while. - - regards, tom lane - -From tgl@sss.pgh.pa.us Sun May 14 21:02:31 2000 -Received: from sss2.sss.pgh.pa.us (sss.pgh.pa.us [209.114.166.2]) - by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id VAA07700 - for ; Sun, 14 May 2000 21:02:28 -0400 (EDT) -Received: from sss2.sss.pgh.pa.us (tgl@localhost [127.0.0.1]) - by sss2.sss.pgh.pa.us (8.9.3/8.9.3) with ESMTP id VAA21261; - Sun, 14 May 2000 21:03:17 -0400 (EDT) -To: Bruce Momjian -cc: PostgreSQL-development -Subject: Re: [HACKERS] type conversion discussion -In-reply-to: <20911.958339770@sss.pgh.pa.us> -References: <200005141950.PAA04636@candle.pha.pa.us> <20911.958339770@sss.pgh.pa.us> -Comments: In-reply-to Tom Lane - message dated "Sun, 14 May 2000 17:29:30 -0400" -Date: Sun, 14 May 2000 21:03:17 -0400 -Message-ID: <21258.958352597@sss.pgh.pa.us> -From: Tom Lane -Status: OR - -Here are the results of some further thoughts about type-conversion -issues. This is not a complete proposal yet, but a sketch of an -approach that might solve several of the gripes in my previous proposal. - -While thinking about this, I realized that my numeric-types proposal -of yesterday would break at least a few cases that work nicely now. -For example, I frequently do things like - select * from pg_class where oid = 1234; -whilst poking around in system tables and querytree dumps. If that -constant is initially resolved as int2, as I suggested yesterday, -then we have "oid = int2" for which there is no operator. To succeed -we must decide to promote the constant to int4 --- but with no int4 -visible among the operands of the "=", it will not work to just "promote -numerics to the highest type seen in the operands" as I suggested -yesterday. So there has to be some more interaction in there. - -Anyway, I was complaining about the looseness of the concept of -binary-compatible types and the fact that the parser's type conversion -knowledge is mostly hardwired. These might be resolved by generalizing -the numeric type hierarchy idea into a "type promotion lattice", which -would work like this: - -* Add a "typpromote" column to pg_type, which contains either zero or - the OID of another type that the parser is allowed to promote this - type to when searching for usable functions/operators. For example, - my numeric-types hierarchy of yesterday would be expressed by making - int2 promote to int4, int4 to int8, int8 to numeric, numeric to - float4, and float4 to float8. The promotion idea also replaces the - current concept of binary-compatible types: for example, OID would - link to int4 and varchar would link to text (but not vice versa!). - -* Also add a "typpromotebin" boolean column to pg_type, which contains - 't' if the type conversion indicated by typpromote is "free", ie, - no conversion function need be executed before regarding a value as - belonging to the promoted type. This distinguishes binary-compatible - from non-binary-compatible cases. If "typpromotebin" is 'f' and the - parser decides it needs to apply the conversion, then it has to look - up the appropriate conversion function in pg_proc. (More about this - below.) - -Now, if the parser fails to find an exact match for a given function -or operator name and the exact set of input data types, it proceeds by -chasing up the promotion chains for the input data types and trying to -locate a set of types for which there is a matching function/operator. -If there are multiple possibilities, we choose the one which is the -"least promoted" by some yet-to-be-determined metric. (This metric -would probably favor "free" conversions over non-free ones, but other -than that I'm not quite sure how it should work. The metric would -replace a whole bunch of ad-hoc heuristics that are currently applied -in the type resolver, so even if it seems rather ad-hoc it'd still be -cleaner than what we have ;-).) - -In a situation like the "oid = int2" example above, this mechanism would -presumably settle on "int4 = int4" as being the least-promoted -equivalent operator. (It could not find "oid = oid" since there is -no promotion path from int2 to oid.) That looks bad since it isn't -compatible with an oidops index --- but I have a solution for that! -I don't think we need the oid opclass at all; why shouldn't indexes -on oid be expressed as int4 indexes to begin with? In general, if -two types are considered binary-equivalent under the old scheme, then -the one that is considered the subtype probably shouldn't have separate -index operators under this new scheme. Instead it should just rely on -the index operators of the promoted type. - -The point of the proposed typpromotebin field is to save a pg_proc -lookup when trying to determine whether a particular promotion is "free" -or not. We could save even more lookups if we didn't store the boolean -but instead the actual OID of the conversion function, or zero if the -promotion is "free". The trouble with that is that it creates a -circularity problem when trying to define a new user type --- you can't -define the conversion function if its input type doesn't exist yet. -In any case, we want the parser to do a function lookup if we've -advanced more than one step in the promotion hierarchy: if we've decided -to promote int4 to float8 (which will be a four-step chain through int8, -numeric, float4) we sure want the thing to use a direct int4tofloat8 -conversion function if available, not a chain of four conversion -functions. So on balance I think we want to look in pg_proc once we've -decided which conversion to perform. The only reason for having -typpromotebin is that the promotion metric will want to know which -conversions are free, and we don't want to have to do a lookup in -pg_proc for each alternative we consider, only the ones that are finally -selected to be used. - -I can think of at least one special case that still isn't cleanly -handled under this scheme, and that is bpchar vs. varchar comparison. -Currently, we have - -regression=# select 'a'::bpchar = 'a '::bpchar; - ?column? ----------- - t -(1 row) - -This is correct since trailing blanks are insignificant in bpchar land, -so the two values should be considered equal. If we try - -regression=# select 'a'::bpchar = 'a '::varchar; -ERROR: Unable to identify an operator '=' for types 'bpchar' and 'varchar' - You will have to retype this query using an explicit cast - -which is pretty bogus but at least it saves the system from making some -random choice about whether bpchar or varchar comparison rules apply. -On the other hand, - -regression=# select 'a'::bpchar = 'a '::text; - ?column? ----------- - f -(1 row) - -Here the bpchar value has been promoted to text and then text comparison -(where trailing blanks *are* significant) is applied. I'm not sure that -we can really justify doing this in this case when we reject the bpchar -vs varchar case, but maybe someone wants to argue that that's correct. - -The natural setup in my type-promotion scheme would be that both bpchar -and varchar link to 'text' as their promoted type. If we do nothing -special then text-style comparison would be used in a bpchar vs varchar -comparison, which is arguably wrong. - -One way to deal with this without introducing kluges into the type -resolver is to provide a full set of bpchar vs text and text vs bpchar -operators, and make sure that the promotion metric is such that these -will be used in place of text vs text operators if they apply (which -should hold, I think, for any reasonable metric). This is probably -the only way to get the "right" behavior in any case --- I think that -the "right" behavior for such comparisons is to strip trailing blanks -from the bpchar side but not the text/varchar side. (I haven't checked -to see if SQL92 agrees, though.) - -Another issue is how to fit resolution of "unknown" literals into this -scheme. We could probably continue to handle them more or less as we -do now, but they might complicate the promotion metric. - -I am not clear yet on whether we'd still need the concept of "type -categories" as they presently exist in the resolver. It's possible -that we wouldn't, which would be a nice simplification. (If we do -still need them, we should have a column in pg_type that defines the -category of a type, instead of hard-wiring category assignments.) - - regards, tom lane - -From e99re41@DoCS.UU.SE Mon May 15 07:39:03 2000 -Received: from meryl.it.uu.se (root@meryl.it.uu.se [130.238.12.42]) - by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id HAA10251 - for ; Mon, 15 May 2000 07:39:01 -0400 (EDT) -Received: from Zebra.DoCS.UU.SE (e99re41@Zebra.DoCS.UU.SE [130.238.9.158]) - by meryl.it.uu.se (8.8.5/8.8.5) with ESMTP id NAA10849; - Mon, 15 May 2000 13:39:45 +0200 (MET DST) -Received: from localhost (e99re41@localhost) by Zebra.DoCS.UU.SE (8.6.12/8.6.12) with ESMTP id NAA26523; Mon, 15 May 2000 13:39:44 +0200 -X-Authentication-Warning: Zebra.DoCS.UU.SE: e99re41 owned process doing -bs -Date: Mon, 15 May 2000 13:39:44 +0200 (MET DST) -From: Peter Eisentraut -Reply-To: Peter Eisentraut -To: Tom Lane -cc: Bruce Momjian , - PostgreSQL-development -Subject: Re: [HACKERS] type conversion discussion -In-Reply-To: <20911.958339770@sss.pgh.pa.us> -Message-ID: -MIME-Version: 1.0 -Content-Type: TEXT/PLAIN; charset=iso-8859-1 -Content-Transfer-Encoding: 8bit -X-MIME-Autoconverted: from QUOTED-PRINTABLE to 8bit by candle.pha.pa.us id HAA10251 -Status: OR - -On Sun, 14 May 2000, Tom Lane wrote: - -> 1. Poor choice of type to attribute to numeric literals. (A possible -> solution is sketched in my earlier message, but do we need similar -> mechanisms for other type categories?) - -I think your plan looks good for the numerical land. (I'll ponder the oid -issues in a second.) For other type categories, perhaps not. Should a line -be promoted to a polygon so you can check if it contains a point? Or a -polygon to a box? Higher dimensions? :-) - - -> 2. Tensions between treating string literals as "unknown" type and -> as "text" type, per this thread so far. - -Yes, while we're at it, let's look at this in detail. I claim that -something of the form 'xxx' should always be text (or char or whatever), -period. Let's consider the cases were this could potentially clash with -the current behaviour: - -a) The target type is unambiguously clear, e.g., UPDATE ... SET. Then you -cast text to the target type. The effect is identical. - -b) The target type is completely unspecified, e.g. CREATE TABLE AS SELECT -'xxx'; This will currently create an "unknown" column. It should arguably -create a "text" column. - -Function argument resolution: - -c) There is only one function and it has a "text" argument. No-brainer. - -d) There is only one function and it has an argument other than text. Try -to cast text to that type. (This is what's done in general, isn't it?) - -e) The function is overloaded for many types, amongst which is text. Then -call the text version. I believe this would currently fail, which I'd -consider a deficiency. - -f) The function is overloaded for many types, none of which is text. In -that case you have to cast anyway, so you don't lose anything. - -On thing to also keep in mind regarding required casting for (b) and (f) -is that SQL never allowed literals of "fancy" types (e.g., DATE) to have -undecorated 'yyyy-mm-dd' constants, you always have to say DATE -'yyyy-mm-dd'. What Postgres allows is a convencience where DATE would be -obvious or implied. In the end it's a win-win situation: you tell the -system what you want, and your code is clearer. - - -> 3. IS_BINARY_COMPATIBLE seems like a bogus concept. - -At least it's bogus when used for types which are not actually binary -compatible, e.g. int4 and oid. The result of the current implementation is -that you can perfectly happily insert and retrieve negative numbers from -oid fields. - -I'm not so sure about the value of this particular equivalency anyway. -AFAICS the only functions that make sense for oids are comparisons (incl. -min, max), adding integers to them, subtracting one oid from another. -Silent mangling with int4 means that you can multiply them, square them, -add floating point numbers to them (doesn't really work in practice -though), all things that have no business with oids. - -I'd say define the operators that are useful for oids explicitly for oids -and require casts for all others, so the users know what they're doing. -The fact that an oid is also a number should be an implementation detail. - -In my mind oids are like pointers in C. Indiscriminate mangling of -pointers and integers in C has long been dismissed as questionable coding. - - -Of course I'd be very willing to consider counterexamples to these -theories ... - --- -Peter Eisentraut Sernanders väg 10:115 -peter_e@gmx.net 75262 Uppsala -http://yi.org/peter-e/ Sweden - - -From tgl@sss.pgh.pa.us Tue Jun 13 04:58:20 2000 -Received: from sss2.sss.pgh.pa.us (sss.pgh.pa.us [209.114.166.2]) - by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id DAA24281 - for ; Tue, 13 Jun 2000 03:58:18 -0400 (EDT) -Received: from sss2.sss.pgh.pa.us (tgl@localhost [127.0.0.1]) - by sss2.sss.pgh.pa.us (8.9.3/8.9.3) with ESMTP id DAA02571; - Tue, 13 Jun 2000 03:58:43 -0400 (EDT) -To: Bruce Momjian -cc: pgsql-hackers@postgresql.org -Subject: Re: [HACKERS] Proposal for fixing numeric type-resolution issues -In-reply-to: <200006130741.DAA23502@candle.pha.pa.us> -References: <200006130741.DAA23502@candle.pha.pa.us> -Comments: In-reply-to Bruce Momjian - message dated "Tue, 13 Jun 2000 03:41:56 -0400" -Date: Tue, 13 Jun 2000 03:58:43 -0400 -Message-ID: <2568.960883123@sss.pgh.pa.us> -From: Tom Lane -Status: OR - -Bruce Momjian writes: -> Again, anything to add to the TODO here? - -IIRC, there was some unhappiness with the proposal you quote, so I'm -not sure we've quite agreed what to do... but clearly something must -be done. - - regards, tom lane - - ->> We've got a collection of problems that are related to the parser's ->> inability to make good type-resolution choices for numeric constants. ->> In some cases you get a hard error; for example "NumericVar + 4.4" ->> yields ->> ERROR: Unable to identify an operator '+' for types 'numeric' and 'float8' ->> You will have to retype this query using an explicit cast ->> because "4.4" is initially typed as float8 and the system can't figure ->> out whether to use numeric or float8 addition. A more subtle problem ->> is that a query like "... WHERE Int2Var < 42" is unable to make use of ->> an index on the int2 column: 42 is resolved as int4, so the operator ->> is int24lt, which works but is not in the opclass of an int2 index. ->> ->> Here is a proposal for fixing these problems. I think we could get this ->> done for 7.1 if people like it. ->> ->> The basic problem is that there's not enough smarts in the type resolver ->> about the interrelationships of the numeric datatypes. All it has is ->> a concept of a most-preferred type within the category of numeric types. ->> (We are abusing the most-preferred-type mechanism, BTW, because both ->> FLOAT8 and NUMERIC claim to be the most-preferred type in the numeric ->> category! This is in fact why the resolver can't make a choice for ->> "numeric+float8".) We need more intelligence than that. ->> ->> I propose that we set up a strictly-ordered hierarchy of numeric ->> datatypes, running from least preferred to most preferred: ->> int2, int4, int8, numeric, float4, float8. ->> Rather than simply considering coercions to the most-preferred type, ->> the type resolver should use the following rules: ->> ->> 1. No value will be down-converted (eg int4 to int2) except by an ->> explicit conversion. ->> ->> 2. If there is not an exact matching operator, numeric values will be ->> up-converted to the highest numeric datatype present among the operator ->> or function's arguments. For example, given "int2 + int8" we'd up- ->> convert the int2 to int8 and apply int8 addition. ->> ->> The final piece of the puzzle is that the type initially assigned to ->> an undecorated numeric constant should be NUMERIC if it contains a ->> decimal point or exponent, and otherwise the smallest of int2, int4, ->> int8, NUMERIC that will represent it. This is a considerable change ->> from the current lexer behavior, where you get either int4 or float8. ->> ->> For example, given "NumericVar + 4.4", the constant 4.4 will initially ->> be assigned type NUMERIC, we will resolve the operator as numeric plus, ->> and everything's fine. Given "Float8Var + 4.4", the constant is still ->> initially numeric, but will be up-converted to float8 so that float8 ->> addition can be used. The end result is the same as in traditional ->> Postgres: you get float8 addition. Given "Int2Var < 42", the constant ->> is initially typed as int2, since it fits, and we end up selecting ->> int2lt, thereby allowing use of an int2 index. (On the other hand, ->> given "Int2Var < 100000", we'd end up using int4lt, which is correct ->> to avoid overflow.) ->> ->> A couple of crucial subtleties here: ->> ->> 1. We are assuming that the parser or optimizer will constant-fold ->> any conversion functions that are introduced. Thus, in the ->> "Float8Var + 4.4" case, the 4.4 is represented as a float8 4.4 by the ->> time execution begins, so there's no performance loss. ->> ->> 2. We cannot lose precision by initially representing a constant as ->> numeric and later converting it to float. Nor can we exceed NUMERIC's ->> range (the default 1000-digit limit is more than the range of IEEE ->> float8 data). It would not work as well to start out by representing ->> a constant as float and then converting it to numeric. ->> ->> Presently, the pg_proc and pg_operator tables contain a pretty fair ->> collection of cross-datatype numeric operators, such as int24lt, ->> float48pl, etc. We could perhaps leave these in, but I believe that ->> it is better to remove them. For example, if int42lt is left in place, ->> then it would capture cases like "Int4Var < 42", whereas we need that ->> to be translated to int4lt so that an int4 index can be used. Removing ->> these operators will eliminate some code bloat and system-catalog bloat ->> to boot. ->> ->> As far as I can tell, this proposal is almost compatible with the rules ->> given in SQL92: in particular, SQL92 specifies that an operator having ->> both "approximate numeric" (float) and "exact numeric" (int or numeric) ->> inputs should deliver an approximate-numeric result. I propose ->> deviating from SQL92 in a single respect: SQL92 specifies that a ->> constant containing an exponent (eg 1.2E34) is approximate numeric, ->> which implies that the result of an operator using it is approximate ->> even if the other operand is exact. I believe it's better to treat ->> such a constant as exact (ie, type NUMERIC) and only convert it to ->> float if the other operand is float. Without doing that, an assignment ->> like ->> UPDATE tab SET NumericVar = 1.234567890123456789012345E34; ->> will not work as desired because the constant will be prematurely ->> coerced to float, causing precision loss. ->> ->> Comments? ->> ->> regards, tom lane ->> - - -> -- -> Bruce Momjian | http://www.op.net/~candle -> pgman@candle.pha.pa.us | (610) 853-3000 -> + If your life is a hard drive, | 830 Blythe Avenue -> + Christ can be your backup. | Drexel Hill, Pennsylvania 19026 - -From tgl@sss.pgh.pa.us Mon Jun 12 14:09:45 2000 -Received: from sss2.sss.pgh.pa.us (sss.pgh.pa.us [209.114.166.2]) - by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id NAA01993 - for ; Mon, 12 Jun 2000 13:09:43 -0400 (EDT) -Received: from sss2.sss.pgh.pa.us (tgl@localhost [127.0.0.1]) - by sss2.sss.pgh.pa.us (8.9.3/8.9.3) with ESMTP id NAA01515; - Mon, 12 Jun 2000 13:10:01 -0400 (EDT) -To: Peter Eisentraut -cc: Bruce Momjian , - "Thomas G. Lockhart" , - PostgreSQL-development -Subject: Re: [HACKERS] Adding time to DATE type -In-reply-to: -References: -Comments: In-reply-to Peter Eisentraut - message dated "Sun, 11 Jun 2000 13:41:24 +0200" -Date: Mon, 12 Jun 2000 13:10:00 -0400 -Message-ID: <1512.960829800@sss.pgh.pa.us> -From: Tom Lane -Status: ORr - -Peter Eisentraut writes: -> Bruce Momjian writes: ->> Can someone give me a TODO summary for this issue? - -> * make 'text' constants default to text type (not unknown) - -> (I think not everyone's completely convinced on this issue, but I don't -> recall anyone being firmly opposed to it.) - -It would be a mistake to eliminate the distinction between unknown and -text. See for example my just-posted response to John Cochran on -pgsql-general about why 'BOULEVARD'::text behaves differently from -'BOULEVARD'::char. If string literals are immediately assigned type -text then we will have serious problems with char(n) fields. - -I think it's fine to assign string literals a type of 'unknown' -initially. What we need to do is add a phase of type resolution that -considers treating them as text, but only after the existing logic fails -to deduce a type. - -(BTW it might be better to treat string literals as defaulting to char(n) -instead of text, allowing the normal promotion rules to replace char(n) -with text if necessary. Not sure if that would make things more or less -confusing for operations that intermix fixed- and variable-width char -types.) - - regards, tom lane - -From pgsql-hackers-owner+M1936@postgresql.org Sun Dec 10 13:17:54 2000 -Received: from mail.postgresql.org (webmail.postgresql.org [216.126.85.28]) - by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id NAA20676 - for ; Sun, 10 Dec 2000 13:17:54 -0500 (EST) -Received: from mail.postgresql.org (webmail.postgresql.org [216.126.85.28]) - by mail.postgresql.org (8.11.1/8.11.1) with SMTP id eBAIGvZ40566; - Sun, 10 Dec 2000 13:16:57 -0500 (EST) - (envelope-from pgsql-hackers-owner+M1936@postgresql.org) -Received: from sss.pgh.pa.us (sss.pgh.pa.us [209.114.132.154]) - by mail.postgresql.org (8.11.1/8.11.1) with ESMTP id eBAI8HZ39820 - for ; Sun, 10 Dec 2000 13:08:17 -0500 (EST) - (envelope-from tgl@sss.pgh.pa.us) -Received: from sss2.sss.pgh.pa.us (tgl@localhost [127.0.0.1]) - by sss.pgh.pa.us (8.11.1/8.11.1) with ESMTP id eBAI82o28682; - Sun, 10 Dec 2000 13:08:02 -0500 (EST) -To: Thomas Lockhart -cc: pgsql-hackers@postgresql.org -Subject: [HACKERS] Unknown-type resolution rules, redux -Date: Sun, 10 Dec 2000 13:08:02 -0500 -Message-ID: <28679.976471682@sss.pgh.pa.us> -From: Tom Lane -Precedence: bulk -Sender: pgsql-hackers-owner@postgresql.org -Status: OR - -parse_coerce.c contains the following conversation --- I believe the -first XXX comment is from me and the second from you: - - /* - * Still too many candidates? Try assigning types for the unknown - * columns. - * - * We do this by examining each unknown argument position to see if all - * the candidates agree on the type category of that slot. If so, and - * if some candidates accept the preferred type in that category, - * eliminate the candidates with other input types. If we are down to - * one candidate at the end, we win. - * - * XXX It's kinda bogus to do this left-to-right, isn't it? If we - * eliminate some candidates because they are non-preferred at the - * first slot, we won't notice that they didn't have the same type - * category for a later slot. - * XXX Hmm. How else would you do this? These candidates are here because - * they all have the same number of matches on arguments with explicit - * types, so from here on left-to-right resolution is as good as any. - * Need a counterexample to see otherwise... - */ - -The comment is out of date anyway because it fails to mention the new -rule about preferring STRING category. But to answer your request for -a counterexample: consider - - SELECT foo('bar', 'baz') - -First, suppose the available candidates are - - foo(float8, int4) - foo(float8, point) - -In this case, we examine the first argument position, see that all the -candidates agree on NUMERIC category, so we consider resolving the first -unknown input to float8. That eliminates neither candidate so we move -on to the second argument position. Here there is a conflict of -categories so we can't eliminate anything, and we decide the call is -ambiguous. That's correct (or at least Operating As Designed ;-)). - -But now suppose we have - - foo(float8, int4) - foo(float4, point) - -Here, at the first position we will still see that all candidates agree -on NUMERIC category, and then we will eliminate candidate 2 because it -isn't the preferred type in that category. Now when we come to the -second argument position, there's only one candidate left so there's -no category conflict. Result: this call is considered non-ambiguous. - -This means there is a left-to-right bias in the algorithm. For example, -the exact same call *would* be considered ambiguous if the candidates' -argument orders were reversed: - - foo(int4, float8) - foo(point, float4) - -I do not like that. You could maybe argue that earlier arguments are -more important than later ones for functions, but it's harder to make -that case for binary operators --- and in any case this behavior is -extremely difficult to explain in prose. - -To fix this, I think we need to split the loop into two passes. -The first pass does *not* remove any candidates. What it does is to -look separately at each UNKNOWN-argument position and attempt to deduce -a probable category for it, using the following rules: - -* If any candidate has an input type of STRING category, use STRING -category; else if all candidates agree on the category, use that -category; else fail because no resolution can be made. - -* The first pass must also remember whether any candidates are of a -preferred type within the selected category. - -The probable categories and exists-preferred-type booleans are saved in -local arrays. (Note this has to be done this way because -IsPreferredType currently allows more than one type to be considered -preferred in a category ... so the first pass cannot try to determine a -unique type, only a category.) - -If we find a category for every UNKNOWN arg, then we enter a second loop -in which we discard candidates. In this pass we discard a candidate if -(a) it is of the wrong category, or (b) it is of the right category but -is not of preferred type in that category, *and* we found candidate(s) -of preferred type at this slot. - -If we end with exactly one candidate then we win. - -It is clear in this algorithm that there is no order dependency: the -conditions for keeping or discarding a candidate are fixed before we -start the second pass, and do not vary depending on which other -candidates were discarded before it. - -Comments? - - regards, tom lane - -From pgsql-general-owner+M18949=candle.pha.pa.us=pgman@postgresql.org Sat Dec 29 15:47:47 2001 -Return-path: -Received: from rs.postgresql.org (server1.pgsql.org [64.39.15.238] (may be forged)) - by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id fBTKlkT05111 - for ; Sat, 29 Dec 2001 15:47:46 -0500 (EST) -Received: from postgresql.org (postgresql.org [64.49.215.8]) - by rs.postgresql.org (8.11.6/8.11.6) with ESMTP id fBTKhZN74322 - for ; Sat, 29 Dec 2001 14:43:35 -0600 (CST) - (envelope-from pgsql-general-owner+M18949=candle.pha.pa.us=pgman@postgresql.org) -Received: from candle.pha.pa.us (216-55-132-35.dsl.san-diego.abac.net [216.55.132.35]) - by postgresql.org (8.11.3/8.11.4) with ESMTP id fBTKaem38452 - for ; Sat, 29 Dec 2001 15:36:40 -0500 (EST) - (envelope-from pgman@candle.pha.pa.us) -Received: (from pgman@localhost) - by candle.pha.pa.us (8.11.6/8.10.1) id fBTKaTg04256; - Sat, 29 Dec 2001 15:36:29 -0500 (EST) -From: Bruce Momjian -Message-ID: <200112292036.fBTKaTg04256@candle.pha.pa.us> -Subject: Re: [GENERAL] Casting Varchar to Numeric -In-Reply-To: <20011206150158.O28880-100000@megazone23.bigpanda.com> -To: Stephan Szabo -Date: Sat, 29 Dec 2001 15:36:29 -0500 (EST) -cc: Andy Marden , pgsql-general@postgresql.org -X-Mailer: ELM [version 2.4ME+ PL96 (25)] -MIME-Version: 1.0 -Content-Transfer-Encoding: 7bit -Content-Type: text/plain; charset=US-ASCII -Precedence: bulk -Sender: pgsql-general-owner@postgresql.org -Status: OR - -> On Mon, 3 Dec 2001, Andy Marden wrote: -> -> > Martijn, -> > -> > It does work (believe it or not). I've now tried the method you mention -> > below - that also works and is much nicer. I can't believe that PostgreSQL -> > can't work this out. Surely implementing an algorithm that understands that -> > if you can go from a ->b and b->c then you can certainly go from a->c. If -> -> It's more complicated than that (and postgres does some of this but not -> all), for example the cast text->float8->numeric potentially loses -> precision and should probably not be an automatic cast for that reason. -> -> > this is viewed as too complex a task for the internals - at least a diagram -> > or some way of understanding how you should go from a->c would be immensely -> > helpful wouldn't it! Daunting for anyone picking up the database and trying -> > to do something simple(!) -> -> There may be a need for documentation on this. Would you like to write -> some ;) - -OK, I ran some tests: - - test=> create table test (x text); - CREATE - test=> insert into test values ('323'); - INSERT 5122745 1 - test=> select cast (x as numeric) from test; - ERROR: Cannot cast type 'text' to 'numeric' - -I can see problems with automatically casting numeric to text because -you have to guess the desired format, but going from text to numeric -seems quite easy to do. Is there a reason we don't do it? - -I can cast to integer and float8 fine: - - test=> select cast ( x as integer) from test; - ?column? - ---------- - 323 - (1 row) - - test=> select cast ( x as float8) from test; - ?column? - ---------- - 323 - (1 row) - --- - Bruce Momjian | http://candle.pha.pa.us - pgman@candle.pha.pa.us | (610) 853-3000 - + If your life is a hard drive, | 830 Blythe Avenue - + Christ can be your backup. | Drexel Hill, Pennsylvania 19026 - ----------------------------(end of broadcast)--------------------------- -TIP 2: you can get off all lists at once with the unregister command - (send "unregister YourEmailAddressHere" to majordomo@postgresql.org) - -From pgsql-general-owner+M18951=candle.pha.pa.us=pgman@postgresql.org Sat Dec 29 19:10:38 2001 -Return-path: -Received: from west.navpoint.com (west.navpoint.com [207.106.42.13]) - by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id fBU0AbT23972 - for ; Sat, 29 Dec 2001 19:10:37 -0500 (EST) -Received: from rs.postgresql.org (server1.pgsql.org [64.39.15.238] (may be forged)) - by west.navpoint.com (8.11.6/8.10.1) with ESMTP id fBTNVj008959 - for ; Sat, 29 Dec 2001 18:31:45 -0500 (EST) -Received: from postgresql.org (postgresql.org [64.49.215.8]) - by rs.postgresql.org (8.11.6/8.11.6) with ESMTP id fBTNQrN78655 - for ; Sat, 29 Dec 2001 17:26:53 -0600 (CST) - (envelope-from pgsql-general-owner+M18951=candle.pha.pa.us=pgman@postgresql.org) -Received: from sss.pgh.pa.us ([192.204.191.242]) - by postgresql.org (8.11.3/8.11.4) with ESMTP id fBTN8Fm47978 - for ; Sat, 29 Dec 2001 18:08:15 -0500 (EST) - (envelope-from tgl@sss.pgh.pa.us) -Received: from sss2.sss.pgh.pa.us (tgl@localhost [127.0.0.1]) - by sss.pgh.pa.us (8.11.4/8.11.4) with ESMTP id fBTN7vg20245; - Sat, 29 Dec 2001 18:07:57 -0500 (EST) -To: Bruce Momjian -cc: Stephan Szabo , - Andy Marden , pgsql-general@postgresql.org -Subject: Re: [GENERAL] Casting Varchar to Numeric -In-Reply-To: <200112292036.fBTKaTg04256@candle.pha.pa.us> -References: <200112292036.fBTKaTg04256@candle.pha.pa.us> -Comments: In-reply-to Bruce Momjian - message dated "Sat, 29 Dec 2001 15:36:29 -0500" -Date: Sat, 29 Dec 2001 18:07:57 -0500 -Message-ID: <20242.1009667277@sss.pgh.pa.us> -From: Tom Lane -Precedence: bulk -Sender: pgsql-general-owner@postgresql.org -Status: OR - -Bruce Momjian writes: -> I can see problems with automatically casting numeric to text because -> you have to guess the desired format, but going from text to numeric -> seems quite easy to do. Is there a reason we don't do it? - -I do not think it's a good idea to have implicit casts between text and -everything under the sun, because that essentially destroys the type -checking system. What we need (see previous discussion) is a flag in -pg_proc that says whether a type conversion function may be invoked -implicitly or not. I've got no problem with offering text(numeric) and -numeric(text) functions that are invoked by explicit function calls or -casts --- I just don't want the system trying to use them to make -sense of a bogus query. - -> I can cast to integer and float8 fine: - -I don't believe that those should be available as implicit casts either. -They are, at the moment: - -regression=# select 33 || 44.0; - ?column? ----------- - 3344 -(1 row) - -Ugh. - - regards, tom lane - ----------------------------(end of broadcast)--------------------------- -TIP 6: Have you searched our list archives? - -http://archives.postgresql.org - diff --git a/doc/TODO.detail/vacuum b/doc/TODO.detail/vacuum deleted file mode 100644 index 58839ac86c..0000000000 --- a/doc/TODO.detail/vacuum +++ /dev/null @@ -1,2126 +0,0 @@ -From Inoue@tpf.co.jp Tue Jan 18 19:08:30 2000 -Received: from sd.tpf.co.jp (sd.tpf.co.jp [210.161.239.34]) - by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id UAA10148 - for ; Tue, 18 Jan 2000 20:08:27 -0500 (EST) -Received: from cadzone ([126.0.1.40] (may be forged)) - by sd.tpf.co.jp (2.5 Build 2640 (Berkeley 8.8.6)/8.8.4) with SMTP - id KAA02790; Wed, 19 Jan 2000 10:08:02 +0900 -From: "Hiroshi Inoue" -To: "Bruce Momjian" -Cc: "pgsql-hackers" -Subject: RE: [HACKERS] Index recreation in vacuum -Date: Wed, 19 Jan 2000 10:13:40 +0900 -Message-ID: <000201bf621a$6b9baf20$2801007e@tpf.co.jp> -MIME-Version: 1.0 -Content-Type: text/plain; - charset="iso-8859-1" -Content-Transfer-Encoding: 7bit -X-Priority: 3 (Normal) -X-MSMail-Priority: Normal -X-Mailer: Microsoft Outlook 8.5, Build 4.71.2173.0 -X-MimeOLE: Produced By Microsoft MimeOLE V5.00.2314.1300 -Importance: Normal -In-Reply-To: <200001181821.NAA02988@candle.pha.pa.us> -Status: ROr - -> -----Original Message----- -> From: Bruce Momjian [mailto:pgman@candle.pha.pa.us] -> -> [Charset iso-8859-1 unsupported, filtering to ASCII...] -> > Hi all, -> > -> > I'm trying to implement REINDEX command. -> > -> > REINDEX operation itself is available everywhere and -> > I've thought about applying it to VACUUM. -> -> That is a good idea. Vacuuming of indexes can be very slow. -> -> > . -> > My plan is as follows. -> > -> > Add a new option to force index recreation in vacuum -> > and if index recreation is specified. -> -> Couldn't we auto-recreate indexes based on the number of tuples moved by -> vacuum, - -Yes,we could probably do it. But I'm not sure the availability of new -vacuum. - -New vacuum would give us a big advantage that -1) Much faster than current if vacuum remove/moves many tuples. -2) Does shrink index files - -But in case of abort/crash -1) couldn't choose index scan for the table -2) unique constraints of the table would be lost - -I don't know how people estimate this disadvantage. - -> -> > Now I'm inclined to use relhasindex of pg_class to -> > validate/invalidate indexes of a table at once. -> -> There are a few calls to CatalogIndexInsert() that know the -> system table they -> are using and know it has indexes, so it does not check that field. You -> could add cases for that. -> - -I think there aren't so many places to check. -I would examine it if my idea is OK. - -Regards. - -Hiroshi Inoue -Inoue@tpf.co.jp - -From owner-pgsql-hackers@hub.org Tue Jan 18 19:15:27 2000 -Received: from hub.org (hub.org [216.126.84.1]) - by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id UAA10454 - for ; Tue, 18 Jan 2000 20:15:26 -0500 (EST) -Received: from localhost (majordom@localhost) - by hub.org (8.9.3/8.9.3) with SMTP id UAA42280; - Tue, 18 Jan 2000 20:10:35 -0500 (EST) - (envelope-from owner-pgsql-hackers) -Received: by hub.org (bulk_mailer v1.5); Tue, 18 Jan 2000 20:10:30 -0500 -Received: (from majordom@localhost) - by hub.org (8.9.3/8.9.3) id UAA42081 - for pgsql-hackers-outgoing; Tue, 18 Jan 2000 20:09:31 -0500 (EST) - (envelope-from owner-pgsql-hackers@postgreSQL.org) -Received: from sd.tpf.co.jp (sd.tpf.co.jp [210.161.239.34]) - by hub.org (8.9.3/8.9.3) with ESMTP id UAA41943 - for ; Tue, 18 Jan 2000 20:08:39 -0500 (EST) - (envelope-from Inoue@tpf.co.jp) -Received: from cadzone ([126.0.1.40] (may be forged)) - by sd.tpf.co.jp (2.5 Build 2640 (Berkeley 8.8.6)/8.8.4) with SMTP - id KAA02790; Wed, 19 Jan 2000 10:08:02 +0900 -From: "Hiroshi Inoue" -To: "Bruce Momjian" -Cc: "pgsql-hackers" -Subject: RE: [HACKERS] Index recreation in vacuum -Date: Wed, 19 Jan 2000 10:13:40 +0900 -Message-ID: <000201bf621a$6b9baf20$2801007e@tpf.co.jp> -MIME-Version: 1.0 -Content-Type: text/plain; - charset="iso-8859-1" -Content-Transfer-Encoding: 7bit -X-Priority: 3 (Normal) -X-MSMail-Priority: Normal -X-Mailer: Microsoft Outlook 8.5, Build 4.71.2173.0 -X-MimeOLE: Produced By Microsoft MimeOLE V5.00.2314.1300 -Importance: Normal -In-Reply-To: <200001181821.NAA02988@candle.pha.pa.us> -Sender: owner-pgsql-hackers@postgreSQL.org -Status: RO - -> -----Original Message----- -> From: Bruce Momjian [mailto:pgman@candle.pha.pa.us] -> -> [Charset iso-8859-1 unsupported, filtering to ASCII...] -> > Hi all, -> > -> > I'm trying to implement REINDEX command. -> > -> > REINDEX operation itself is available everywhere and -> > I've thought about applying it to VACUUM. -> -> That is a good idea. Vacuuming of indexes can be very slow. -> -> > . -> > My plan is as follows. -> > -> > Add a new option to force index recreation in vacuum -> > and if index recreation is specified. -> -> Couldn't we auto-recreate indexes based on the number of tuples moved by -> vacuum, - -Yes,we could probably do it. But I'm not sure the availability of new -vacuum. - -New vacuum would give us a big advantage that -1) Much faster than current if vacuum remove/moves many tuples. -2) Does shrink index files - -But in case of abort/crash -1) couldn't choose index scan for the table -2) unique constraints of the table would be lost - -I don't know how people estimate this disadvantage. - -> -> > Now I'm inclined to use relhasindex of pg_class to -> > validate/invalidate indexes of a table at once. -> -> There are a few calls to CatalogIndexInsert() that know the -> system table they -> are using and know it has indexes, so it does not check that field. You -> could add cases for that. -> - -I think there aren't so many places to check. -I would examine it if my idea is OK. - -Regards. - -Hiroshi Inoue -Inoue@tpf.co.jp - -************ - -From owner-pgsql-hackers@hub.org Tue Jan 18 19:57:21 2000 -Received: from hub.org (hub.org [216.126.84.1]) - by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id UAA11764 - for ; Tue, 18 Jan 2000 20:57:19 -0500 (EST) -Received: from localhost (majordom@localhost) - by hub.org (8.9.3/8.9.3) with SMTP id UAA50653; - Tue, 18 Jan 2000 20:52:38 -0500 (EST) - (envelope-from owner-pgsql-hackers) -Received: by hub.org (bulk_mailer v1.5); Tue, 18 Jan 2000 20:52:30 -0500 -Received: (from majordom@localhost) - by hub.org (8.9.3/8.9.3) id UAA50513 - for pgsql-hackers-outgoing; Tue, 18 Jan 2000 20:51:32 -0500 (EST) - (envelope-from owner-pgsql-hackers@postgreSQL.org) -Received: from candle.pha.pa.us (pgman@s5-03.ppp.op.net [209.152.195.67]) - by hub.org (8.9.3/8.9.3) with ESMTP id UAA50462 - for ; Tue, 18 Jan 2000 20:51:06 -0500 (EST) - (envelope-from pgman@candle.pha.pa.us) -Received: (from pgman@localhost) - by candle.pha.pa.us (8.9.0/8.9.0) id UAA11421; - Tue, 18 Jan 2000 20:50:50 -0500 (EST) -From: Bruce Momjian -Message-Id: <200001190150.UAA11421@candle.pha.pa.us> -Subject: Re: [HACKERS] Index recreation in vacuum -In-Reply-To: <000201bf621a$6b9baf20$2801007e@tpf.co.jp> from Hiroshi Inoue at - "Jan 19, 2000 10:13:40 am" -To: Hiroshi Inoue -Date: Tue, 18 Jan 2000 20:50:50 -0500 (EST) -CC: pgsql-hackers -X-Mailer: ELM [version 2.4ME+ PL66 (25)] -MIME-Version: 1.0 -Content-Type: text/plain; charset=US-ASCII -Content-Transfer-Encoding: 7bit -Sender: owner-pgsql-hackers@postgreSQL.org -Status: ROr - -> > > Add a new option to force index recreation in vacuum -> > > and if index recreation is specified. -> > -> > Couldn't we auto-recreate indexes based on the number of tuples moved by -> > vacuum, -> -> Yes,we could probably do it. But I'm not sure the availability of new -> vacuum. -> -> New vacuum would give us a big advantage that -> 1) Much faster than current if vacuum remove/moves many tuples. -> 2) Does shrink index files -> -> But in case of abort/crash -> 1) couldn't choose index scan for the table -> 2) unique constraints of the table would be lost -> -> I don't know how people estimate this disadvantage. - -That's why I was recommending rename(). The actual window of -vunerability goes from perhaps hours to fractions of a second. - -In fact, if I understand this right, you could make the vulerability -zero by just performing the rename as one operation. - -In fact, for REINDEX cases where you don't have a lock on the entire -table as you do in vacuum, you could reindex the table with a simple -read-lock on the base table and index, and move the new index into place -with the users seeing no change. Only people traversing the index -during the change would have a problem. You just need an exclusive -access on the index for the duration of the rename() so no one is -traversing the index during the rename(). - -Destroying the index and recreating opens a large time span that there -is no index, and you have to jury-rig something so people don't try to -use the index. With rename() you just put the new index in place with -one operation. Just don't let people traverse the index during the -change. The pointers to the heap tuples is the same in both indexes. - -In fact, with WAL, we will allow multiple physical files for the same -table by appending the table oid to the file name. In this case, the -old index could be deleted by rename, and people would continue to use -the old index until they deleted the open file pointers. Not sure how -this works in practice because new tuples would not be inserted into the -old copy of the index. - - --- - Bruce Momjian | http://www.op.net/~candle - pgman@candle.pha.pa.us | (610) 853-3000 - + If your life is a hard drive, | 830 Blythe Avenue - + Christ can be your backup. | Drexel Hill, Pennsylvania 19026 - -************ - -From pgman Tue Jan 18 20:04:11 2000 -Received: (from pgman@localhost) - by candle.pha.pa.us (8.9.0/8.9.0) id VAA11990; - Tue, 18 Jan 2000 21:04:11 -0500 (EST) -From: Bruce Momjian -Message-Id: <200001190204.VAA11990@candle.pha.pa.us> -Subject: Re: [HACKERS] Index recreation in vacuum -In-Reply-To: <200001190150.UAA11421@candle.pha.pa.us> from Bruce Momjian at "Jan - 18, 2000 08:50:50 pm" -To: Bruce Momjian -Date: Tue, 18 Jan 2000 21:04:11 -0500 (EST) -CC: Hiroshi Inoue , - pgsql-hackers -X-Mailer: ELM [version 2.4ME+ PL66 (25)] -MIME-Version: 1.0 -Content-Type: text/plain; charset=US-ASCII -Content-Transfer-Encoding: 7bit -Status: RO - -> > I don't know how people estimate this disadvantage. -> -> That's why I was recommending rename(). The actual window of -> vunerability goes from perhaps hours to fractions of a second. -> -> In fact, if I understand this right, you could make the vulerability -> zero by just performing the rename as one operation. -> -> In fact, for REINDEX cases where you don't have a lock on the entire -> table as you do in vacuum, you could reindex the table with a simple -> read-lock on the base table and index, and move the new index into place -> with the users seeing no change. Only people traversing the index -> during the change would have a problem. You just need an exclusive -> access on the index for the duration of the rename() so no one is -> traversing the index during the rename(). -> -> Destroying the index and recreating opens a large time span that there -> is no index, and you have to jury-rig something so people don't try to -> use the index. With rename() you just put the new index in place with -> one operation. Just don't let people traverse the index during the -> change. The pointers to the heap tuples is the same in both indexes. -> -> In fact, with WAL, we will allow multiple physical files for the same -> table by appending the table oid to the file name. In this case, the -> old index could be deleted by rename, and people would continue to use -> the old index until they deleted the open file pointers. Not sure how -> this works in practice because new tuples would not be inserted into the -> old copy of the index. - -Maybe I am all wrong here. Maybe most of the advantage of rename() are -meaningless with reindex using during vacuum, which is the most -important use of reindex. - -Let's look at index using during vacuum. Right now, how does vacuum -handle indexes when it moves a tuple? Does it do each index update as -it moves a tuple? Is that why it is so slow? - -If we don't do that and vacuum fails, what state is the table left in? -If we don't update the index for every tuple, the index is invalid in a -vacuum failure. rename() is not going to help us here. It keeps the -old index around, but the index is invalid anyway, right? - - --- - Bruce Momjian | http://www.op.net/~candle - pgman@candle.pha.pa.us | (610) 853-3000 - + If your life is a hard drive, | 830 Blythe Avenue - + Christ can be your backup. | Drexel Hill, Pennsylvania 19026 - -From Inoue@tpf.co.jp Tue Jan 18 20:18:48 2000 -Received: from sd.tpf.co.jp (sd.tpf.co.jp [210.161.239.34]) - by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id VAA12437 - for ; Tue, 18 Jan 2000 21:18:46 -0500 (EST) -Received: from cadzone ([126.0.1.40] (may be forged)) - by sd.tpf.co.jp (2.5 Build 2640 (Berkeley 8.8.6)/8.8.4) with SMTP - id LAA02845; Wed, 19 Jan 2000 11:18:18 +0900 -From: "Hiroshi Inoue" -To: "Bruce Momjian" -Cc: "pgsql-hackers" -Subject: RE: [HACKERS] Index recreation in vacuum -Date: Wed, 19 Jan 2000 11:23:55 +0900 -Message-ID: <000801bf6224$3bfdd9a0$2801007e@tpf.co.jp> -MIME-Version: 1.0 -Content-Type: text/plain; - charset="iso-8859-1" -Content-Transfer-Encoding: 7bit -X-Priority: 3 (Normal) -X-MSMail-Priority: Normal -X-Mailer: Microsoft Outlook 8.5, Build 4.71.2173.0 -X-MimeOLE: Produced By Microsoft MimeOLE V5.00.2314.1300 -Importance: Normal -In-Reply-To: <200001190204.VAA11990@candle.pha.pa.us> -Status: ROr - -> -----Original Message----- -> From: Bruce Momjian [mailto:pgman@candle.pha.pa.us] -> -> > > I don't know how people estimate this disadvantage. -> > -> > That's why I was recommending rename(). The actual window of -> > vunerability goes from perhaps hours to fractions of a second. -> > -> > In fact, if I understand this right, you could make the vulerability -> > zero by just performing the rename as one operation. -> > -> > In fact, for REINDEX cases where you don't have a lock on the entire -> > table as you do in vacuum, you could reindex the table with a simple -> > read-lock on the base table and index, and move the new index into place -> > with the users seeing no change. Only people traversing the index -> > during the change would have a problem. You just need an exclusive -> > access on the index for the duration of the rename() so no one is -> > traversing the index during the rename(). -> > -> > Destroying the index and recreating opens a large time span that there -> > is no index, and you have to jury-rig something so people don't try to -> > use the index. With rename() you just put the new index in place with -> > one operation. Just don't let people traverse the index during the -> > change. The pointers to the heap tuples is the same in both indexes. -> > -> > In fact, with WAL, we will allow multiple physical files for the same -> > table by appending the table oid to the file name. In this case, the -> > old index could be deleted by rename, and people would continue to use -> > the old index until they deleted the open file pointers. Not sure how -> > this works in practice because new tuples would not be inserted into the -> > old copy of the index. -> -> Maybe I am all wrong here. Maybe most of the advantage of rename() are -> meaningless with reindex using during vacuum, which is the most -> important use of reindex. -> -> Let's look at index using during vacuum. Right now, how does vacuum -> handle indexes when it moves a tuple? Does it do each index update as -> it moves a tuple? Is that why it is so slow? -> - -Yes,I believe so. It's necessary to keep consistency between heap -table and indexes even in case of abort/crash. -As far as I see,it has been a big charge for vacuum. - -Regards. - -Hiroshi Inoue -Inoue@tpf.co.jp - - -From owner-pgsql-hackers@hub.org Tue Jan 18 20:53:49 2000 -Received: from hub.org (hub.org [216.126.84.1]) - by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id VAA13285 - for ; Tue, 18 Jan 2000 21:53:47 -0500 (EST) -Received: from localhost (majordom@localhost) - by hub.org (8.9.3/8.9.3) with SMTP id VAA65183; - Tue, 18 Jan 2000 21:47:47 -0500 (EST) - (envelope-from owner-pgsql-hackers) -Received: by hub.org (bulk_mailer v1.5); Tue, 18 Jan 2000 21:47:33 -0500 -Received: (from majordom@localhost) - by hub.org (8.9.3/8.9.3) id VAA65091 - for pgsql-hackers-outgoing; Tue, 18 Jan 2000 21:46:33 -0500 (EST) - (envelope-from owner-pgsql-hackers@postgreSQL.org) -Received: from candle.pha.pa.us (pgman@s5-03.ppp.op.net [209.152.195.67]) - by hub.org (8.9.3/8.9.3) with ESMTP id VAA65034 - for ; Tue, 18 Jan 2000 21:46:12 -0500 (EST) - (envelope-from pgman@candle.pha.pa.us) -Received: (from pgman@localhost) - by candle.pha.pa.us (8.9.0/8.9.0) id VAA13040; - Tue, 18 Jan 2000 21:45:27 -0500 (EST) -From: Bruce Momjian -Message-Id: <200001190245.VAA13040@candle.pha.pa.us> -Subject: Re: [HACKERS] Index recreation in vacuum -In-Reply-To: <000801bf6224$3bfdd9a0$2801007e@tpf.co.jp> from Hiroshi Inoue at - "Jan 19, 2000 11:23:55 am" -To: Hiroshi Inoue -Date: Tue, 18 Jan 2000 21:45:27 -0500 (EST) -CC: pgsql-hackers -X-Mailer: ELM [version 2.4ME+ PL66 (25)] -MIME-Version: 1.0 -Content-Type: text/plain; charset=US-ASCII -Content-Transfer-Encoding: 7bit -Sender: owner-pgsql-hackers@postgreSQL.org -Status: RO - -> > > In fact, for REINDEX cases where you don't have a lock on the entire -> > > table as you do in vacuum, you could reindex the table with a simple -> > > read-lock on the base table and index, and move the new index into place -> > > with the users seeing no change. Only people traversing the index -> > > during the change would have a problem. You just need an exclusive -> > > access on the index for the duration of the rename() so no one is -> > > traversing the index during the rename(). -> > > -> > > Destroying the index and recreating opens a large time span that there -> > > is no index, and you have to jury-rig something so people don't try to -> > > use the index. With rename() you just put the new index in place with -> > > one operation. Just don't let people traverse the index during the -> > > change. The pointers to the heap tuples is the same in both indexes. -> > > -> > > In fact, with WAL, we will allow multiple physical files for the same -> > > table by appending the table oid to the file name. In this case, the -> > > old index could be deleted by rename, and people would continue to use -> > > the old index until they deleted the open file pointers. Not sure how -> > > this works in practice because new tuples would not be inserted into the -> > > old copy of the index. -> > -> > Maybe I am all wrong here. Maybe most of the advantage of rename() are -> > meaningless with reindex using during vacuum, which is the most -> > important use of reindex. -> > -> > Let's look at index using during vacuum. Right now, how does vacuum -> > handle indexes when it moves a tuple? Does it do each index update as -> > it moves a tuple? Is that why it is so slow? -> > -> -> Yes,I believe so. It's necessary to keep consistency between heap -> table and indexes even in case of abort/crash. -> As far as I see,it has been a big charge for vacuum. - -OK, how about making a copy of the heap table before starting vacuum, -moving all the tuples in that copy, create new index, and then move the -new heap and indexes over the old version. We already have an exclusive -lock on the table. That would be 100% reliable, with the disadvantage -of using 2x the disk space. Seems like a big win. - --- - Bruce Momjian | http://www.op.net/~candle - pgman@candle.pha.pa.us | (610) 853-3000 - + If your life is a hard drive, | 830 Blythe Avenue - + Christ can be your backup. | Drexel Hill, Pennsylvania 19026 - -************ - -From owner-pgsql-hackers@hub.org Tue Jan 18 21:15:24 2000 -Received: from hub.org (hub.org [216.126.84.1]) - by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id WAA14115 - for ; Tue, 18 Jan 2000 22:15:23 -0500 (EST) -Received: from localhost (majordom@localhost) - by hub.org (8.9.3/8.9.3) with SMTP id WAA72950; - Tue, 18 Jan 2000 22:10:40 -0500 (EST) - (envelope-from owner-pgsql-hackers) -Received: by hub.org (bulk_mailer v1.5); Tue, 18 Jan 2000 22:10:32 -0500 -Received: (from majordom@localhost) - by hub.org (8.9.3/8.9.3) id WAA72644 - for pgsql-hackers-outgoing; Tue, 18 Jan 2000 22:09:36 -0500 (EST) - (envelope-from owner-pgsql-hackers@postgreSQL.org) -Received: from candle.pha.pa.us (pgman@s5-03.ppp.op.net [209.152.195.67]) - by hub.org (8.9.3/8.9.3) with ESMTP id WAA72504 - for ; Tue, 18 Jan 2000 22:08:40 -0500 (EST) - (envelope-from pgman@candle.pha.pa.us) -Received: (from pgman@localhost) - by candle.pha.pa.us (8.9.0/8.9.0) id WAA13965; - Tue, 18 Jan 2000 22:08:25 -0500 (EST) -From: Bruce Momjian -Message-Id: <200001190308.WAA13965@candle.pha.pa.us> -Subject: Re: [HACKERS] Index recreation in vacuum -In-Reply-To: <000f01bf622a$bf423940$2801007e@tpf.co.jp> from Hiroshi Inoue at - "Jan 19, 2000 12:10:32 pm" -To: Hiroshi Inoue -Date: Tue, 18 Jan 2000 22:08:25 -0500 (EST) -CC: pgsql-hackers -X-Mailer: ELM [version 2.4ME+ PL66 (25)] -MIME-Version: 1.0 -Content-Type: text/plain; charset=UNKNOWN-8BIT -Content-Transfer-Encoding: 8bit -Sender: owner-pgsql-hackers@postgreSQL.org -Status: RO - -> I heard from someone that old vacuum had been like so. -> Probably 2x disk space for big tables was a big disadvantage. - -That's interesting. - -> -> In addition,rename(),unlink(),mv aren't preferable for transaction -> control as far as I see. We couldn't avoid inconsistency using -> those OS functions. - -I disagree. Vacuum can't be rolled back anyway in the sense you can -bring back expire tuples, though I have no idea why you would want to. - -You have an exclusive lock on the table. Putting new heap/indexes in -place that match and have no expired tuples seems like it can not fail -in any situation. - -Of course, the buffers of the old table have to be marked as invalid, -but with an exclusive lock, that is not a problem. I am sure we do that -anyway in vacuum. - -> We have to wait the change of relation file naming if copying -> vacuum is needed. -> Under the spec we need not rename(),mv etc. - -Sorry, I don't agree, yet... - --- - Bruce Momjian | http://www.op.net/~candle - pgman@candle.pha.pa.us | (610) 853-3000 - + If your life is a hard drive, | 830 Blythe Avenue - + Christ can be your backup. | Drexel Hill, Pennsylvania 19026 - -************ - -From Inoue@tpf.co.jp Tue Jan 18 21:05:23 2000 -Received: from sd.tpf.co.jp (sd.tpf.co.jp [210.161.239.34]) - by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id WAA13858 - for ; Tue, 18 Jan 2000 22:05:21 -0500 (EST) -Received: from cadzone ([126.0.1.40] (may be forged)) - by sd.tpf.co.jp (2.5 Build 2640 (Berkeley 8.8.6)/8.8.4) with SMTP - id MAA02870; Wed, 19 Jan 2000 12:04:55 +0900 -From: "Hiroshi Inoue" -To: "Bruce Momjian" -Cc: "pgsql-hackers" -Subject: RE: [HACKERS] Index recreation in vacuum -Date: Wed, 19 Jan 2000 12:10:32 +0900 -Message-ID: <000f01bf622a$bf423940$2801007e@tpf.co.jp> -MIME-Version: 1.0 -Content-Type: text/plain; - charset="iso-8859-1" -Content-Transfer-Encoding: 7bit -X-Priority: 3 (Normal) -X-MSMail-Priority: Normal -X-Mailer: Microsoft Outlook 8.5, Build 4.71.2173.0 -X-MimeOLE: Produced By Microsoft MimeOLE V5.00.2314.1300 -Importance: Normal -In-Reply-To: <200001190245.VAA13040@candle.pha.pa.us> -Status: ROr - -> -----Original Message----- -> From: Bruce Momjian [mailto:pgman@candle.pha.pa.us] -> > > -> > > Maybe I am all wrong here. Maybe most of the advantage of -> rename() are -> > > meaningless with reindex using during vacuum, which is the most -> > > important use of reindex. -> > > -> > > Let's look at index using during vacuum. Right now, how does vacuum -> > > handle indexes when it moves a tuple? Does it do each index update as -> > > it moves a tuple? Is that why it is so slow? -> > > -> > -> > Yes,I believe so. It's necessary to keep consistency between heap -> > table and indexes even in case of abort/crash. -> > As far as I see,it has been a big charge for vacuum. -> -> OK, how about making a copy of the heap table before starting vacuum, -> moving all the tuples in that copy, create new index, and then move the -> new heap and indexes over the old version. We already have an exclusive -> lock on the table. That would be 100% reliable, with the disadvantage -> of using 2x the disk space. Seems like a big win. -> - -I heard from someone that old vacuum had been like so. -Probably 2x disk space for big tables was a big disadvantage. - -In addition,rename(),unlink(),mv aren't preferable for transaction -control as far as I see. We couldn't avoid inconsistency using -those OS functions. -We have to wait the change of relation file naming if copying -vacuum is needed. -Under the spec we need not rename(),mv etc. - -Regards. - -Hiroshi Inoue -Inoue@tpf.co.jp - - - - -From dms@wplus.net Wed Jan 19 15:30:40 2000 -Received: from relay.wplus.net (relay.wplus.net [195.131.52.179]) - by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id QAA25919 - for ; Wed, 19 Jan 2000 16:30:38 -0500 (EST) -X-Real-To: pgman@candle.pha.pa.us -Received: from wplus.net (ppdms.dialup.wplus.net [195.131.52.71]) - by relay.wplus.net (8.9.1/8.9.1/wplus.2) with ESMTP id AAA64218; - Thu, 20 Jan 2000 00:26:37 +0300 (MSK) -Message-ID: <38862C9D.C2151E4E@wplus.net> -Date: Thu, 20 Jan 2000 00:29:01 +0300 -From: Dmitry Samersoff -X-Mailer: Mozilla 4.61 [en] (WinNT; I) -X-Accept-Language: ru,en -MIME-Version: 1.0 -To: Hiroshi Inoue -CC: Bruce Momjian , - pgsql-hackers -Subject: Re: [HACKERS] Index recreation in vacuum -References: <000f01bf622a$bf423940$2801007e@tpf.co.jp> -Content-Type: text/plain; charset=koi8-r -Content-Transfer-Encoding: 7bit -Status: ROr - -Hiroshi Inoue wrote: -> > > Yes,I believe so. It's necessary to keep consistency between heap -> > > table and indexes even in case of abort/crash. -> > > As far as I see,it has been a big charge for vacuum. -> > -> > OK, how about making a copy of the heap table before starting vacuum, -> > moving all the tuples in that copy, create new index, and then move the -> > new heap and indexes over the old version. We already have an exclusive -> > lock on the table. That would be 100% reliable, with the disadvantage -> > of using 2x the disk space. Seems like a big win. -> > -> -> I heard from someone that old vacuum had been like so. -> Probably 2x disk space for big tables was a big disadvantage. - -Yes, It is critical. - -How about sequence like this: - -* Drop indices (keeping somewhere index descriptions) -* vacuuming table -* recreate indices - -If something crash, user have been noticed -to re-run vacuum or recreate indices by hand -when system restarts. - -I use script like described above for vacuuming - - it really increase vacuum performance for large table. - - --- -Dmitry Samersoff, DM\S -dms@wplus.net http://devnull.wplus.net -* there will come soft rains - -From dms@wplus.net Wed Jan 19 15:42:49 2000 -Received: from relay.wplus.net (relay.wplus.net [195.131.52.179]) - by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id QAA26645 - for ; Wed, 19 Jan 2000 16:42:47 -0500 (EST) -X-Real-To: pgman@candle.pha.pa.us -Received: from wplus.net (ppdms.dialup.wplus.net [195.131.52.71]) - by relay.wplus.net (8.9.1/8.9.1/wplus.2) with ESMTP id AAA65264; - Thu, 20 Jan 2000 00:39:02 +0300 (MSK) -Message-ID: <38862F86.20328BD3@wplus.net> -Date: Thu, 20 Jan 2000 00:41:26 +0300 -From: Dmitry Samersoff -X-Mailer: Mozilla 4.61 [en] (WinNT; I) -X-Accept-Language: ru,en -MIME-Version: 1.0 -To: Bruce Momjian -CC: Hiroshi Inoue , - pgsql-hackers -Subject: Re: [HACKERS] Index recreation in vacuum -References: <200001192132.QAA26048@candle.pha.pa.us> -Content-Type: text/plain; charset=koi8-r -Content-Transfer-Encoding: 7bit -Status: ROr - -Bruce Momjian wrote: -> -> We need two things: -> - -> auto-create index on startup - -IMHO, It have to be controlled by user, because creating large index -can take a number of hours. Sometimes it's better to live without -indices -at all, and then build it by hand after workday end. - - --- -Dmitry Samersoff, DM\S -dms@wplus.net http://devnull.wplus.net -* there will come soft rains - -From owner-pgsql-hackers@hub.org Thu Jan 20 23:51:34 2000 -Received: from hub.org (hub.org [216.126.84.1]) - by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id AAA13891 - for ; Fri, 21 Jan 2000 00:51:31 -0500 (EST) -Received: from localhost (majordom@localhost) - by hub.org (8.9.3/8.9.3) with SMTP id AAA91784; - Fri, 21 Jan 2000 00:47:07 -0500 (EST) - (envelope-from owner-pgsql-hackers) -Received: by hub.org (bulk_mailer v1.5); Fri, 21 Jan 2000 00:45:38 -0500 -Received: (from majordom@localhost) - by hub.org (8.9.3/8.9.3) id AAA91495 - for pgsql-hackers-outgoing; Fri, 21 Jan 2000 00:44:40 -0500 (EST) - (envelope-from owner-pgsql-hackers@postgreSQL.org) -Received: from candle.pha.pa.us (pgman@s5-03.ppp.op.net [209.152.195.67]) - by hub.org (8.9.3/8.9.3) with ESMTP id AAA91378 - for ; Fri, 21 Jan 2000 00:44:04 -0500 (EST) - (envelope-from pgman@candle.pha.pa.us) -Received: (from pgman@localhost) - by candle.pha.pa.us (8.9.0/8.9.0) id AAA13592; - Fri, 21 Jan 2000 00:43:49 -0500 (EST) -From: Bruce Momjian -Message-Id: <200001210543.AAA13592@candle.pha.pa.us> -Subject: [HACKERS] vacuum timings -To: Tom Lane -Date: Fri, 21 Jan 2000 00:43:49 -0500 (EST) -CC: PostgreSQL-development -X-Mailer: ELM [version 2.4ME+ PL66 (25)] -MIME-Version: 1.0 -Content-Type: text/plain; charset=US-ASCII -Content-Transfer-Encoding: 7bit -Sender: owner-pgsql-hackers@postgreSQL.org -Status: RO - -I loaded 10,000,000 rows into CREATE TABLE test (x INTEGER); Table is -400MB and index is 160MB. - -With index on the single in4 column, I got: - 78 seconds for a vacuum - 121 seconds for vacuum after deleting a single row - 662 seconds for vacuum after deleting the entire table - -With no index, I got: - 43 seconds for a vacuum - 43 seconds for vacuum after deleting a single row - 43 seconds for vacuum after deleting the entire table - -I find this quite interesting. - --- - Bruce Momjian | http://www.op.net/~candle - pgman@candle.pha.pa.us | (610) 853-3000 - + If your life is a hard drive, | 830 Blythe Avenue - + Christ can be your backup. | Drexel Hill, Pennsylvania 19026 - -************ - -From owner-pgsql-hackers@hub.org Fri Jan 21 00:34:56 2000 -Received: from hub.org (hub.org [216.126.84.1]) - by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id BAA15559 - for ; Fri, 21 Jan 2000 01:34:55 -0500 (EST) -Received: from localhost (majordom@localhost) - by hub.org (8.9.3/8.9.3) with SMTP id BAA06108; - Fri, 21 Jan 2000 01:32:23 -0500 (EST) - (envelope-from owner-pgsql-hackers) -Received: by hub.org (bulk_mailer v1.5); Fri, 21 Jan 2000 01:30:38 -0500 -Received: (from majordom@localhost) - by hub.org (8.9.3/8.9.3) id BAA03704 - for pgsql-hackers-outgoing; Fri, 21 Jan 2000 01:27:53 -0500 (EST) - (envelope-from owner-pgsql-hackers@postgreSQL.org) -Received: from sunpine.krs.ru (SunPine.krs.ru [195.161.16.37]) - by hub.org (8.9.3/8.9.3) with ESMTP id BAA01710 - for ; Fri, 21 Jan 2000 01:26:44 -0500 (EST) - (envelope-from vadim@krs.ru) -Received: from krs.ru (dune.krs.ru [195.161.16.38]) - by sunpine.krs.ru (8.8.8/8.8.8) with ESMTP id NAA01685; - Fri, 21 Jan 2000 13:26:33 +0700 (KRS) -Message-ID: <3887FC19.80305217@krs.ru> -Date: Fri, 21 Jan 2000 13:26:33 +0700 -From: Vadim Mikheev -Organization: OJSC Rostelecom (Krasnoyarsk) -X-Mailer: Mozilla 4.5 [en] (X11; I; FreeBSD 3.0-RELEASE i386) -X-Accept-Language: ru, en -MIME-Version: 1.0 -To: Bruce Momjian -CC: Tom Lane , - PostgreSQL-development -Subject: Re: [HACKERS] vacuum timings -References: <200001210543.AAA13592@candle.pha.pa.us> -Content-Type: text/plain; charset=us-ascii -Content-Transfer-Encoding: 7bit -Sender: owner-pgsql-hackers@postgreSQL.org -Status: RO - -Bruce Momjian wrote: -> -> I loaded 10,000,000 rows into CREATE TABLE test (x INTEGER); Table is -> 400MB and index is 160MB. -> -> With index on the single in4 column, I got: -> 78 seconds for a vacuum -> 121 seconds for vacuum after deleting a single row -> 662 seconds for vacuum after deleting the entire table -> -> With no index, I got: -> 43 seconds for a vacuum -> 43 seconds for vacuum after deleting a single row -> 43 seconds for vacuum after deleting the entire table - -Wi/wo -F ? - -Vadim - -************ - -From vadim@krs.ru Fri Jan 21 00:26:33 2000 -Received: from sunpine.krs.ru (SunPine.krs.ru [195.161.16.37]) - by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id BAA15239 - for ; Fri, 21 Jan 2000 01:26:31 -0500 (EST) -Received: from krs.ru (dune.krs.ru [195.161.16.38]) - by sunpine.krs.ru (8.8.8/8.8.8) with ESMTP id NAA01685; - Fri, 21 Jan 2000 13:26:33 +0700 (KRS) -Sender: root@sunpine.krs.ru -Message-ID: <3887FC19.80305217@krs.ru> -Date: Fri, 21 Jan 2000 13:26:33 +0700 -From: Vadim Mikheev -Organization: OJSC Rostelecom (Krasnoyarsk) -X-Mailer: Mozilla 4.5 [en] (X11; I; FreeBSD 3.0-RELEASE i386) -X-Accept-Language: ru, en -MIME-Version: 1.0 -To: Bruce Momjian -CC: Tom Lane , - PostgreSQL-development -Subject: Re: [HACKERS] vacuum timings -References: <200001210543.AAA13592@candle.pha.pa.us> -Content-Type: text/plain; charset=us-ascii -Content-Transfer-Encoding: 7bit -Status: ROr - -Bruce Momjian wrote: -> -> I loaded 10,000,000 rows into CREATE TABLE test (x INTEGER); Table is -> 400MB and index is 160MB. -> -> With index on the single in4 column, I got: -> 78 seconds for a vacuum -> 121 seconds for vacuum after deleting a single row -> 662 seconds for vacuum after deleting the entire table -> -> With no index, I got: -> 43 seconds for a vacuum -> 43 seconds for vacuum after deleting a single row -> 43 seconds for vacuum after deleting the entire table - -Wi/wo -F ? - -Vadim - -From Inoue@tpf.co.jp Fri Jan 21 00:40:35 2000 -Received: from sd.tpf.co.jp (sd.tpf.co.jp [210.161.239.34]) - by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id BAA15684 - for ; Fri, 21 Jan 2000 01:40:33 -0500 (EST) -Received: from cadzone ([126.0.1.40] (may be forged)) - by sd.tpf.co.jp (2.5 Build 2640 (Berkeley 8.8.6)/8.8.4) with SMTP - id PAA04316; Fri, 21 Jan 2000 15:40:35 +0900 -From: "Hiroshi Inoue" -To: "Bruce Momjian" -Cc: "PostgreSQL-development" , - "Tom Lane" -Subject: RE: [HACKERS] vacuum timings -Date: Fri, 21 Jan 2000 15:46:15 +0900 -Message-ID: <000201bf63db$36cdae20$2801007e@tpf.co.jp> -MIME-Version: 1.0 -Content-Type: text/plain; - charset="iso-8859-1" -Content-Transfer-Encoding: 7bit -X-Priority: 3 (Normal) -X-MSMail-Priority: Normal -X-Mailer: Microsoft Outlook 8.5, Build 4.71.2173.0 -Importance: Normal -X-MimeOLE: Produced By Microsoft MimeOLE V5.00.2314.1300 -In-Reply-To: <200001210543.AAA13592@candle.pha.pa.us> -Status: RO - -> -----Original Message----- -> From: owner-pgsql-hackers@postgreSQL.org -> [mailto:owner-pgsql-hackers@postgreSQL.org]On Behalf Of Bruce Momjian -> -> I loaded 10,000,000 rows into CREATE TABLE test (x INTEGER); Table is -> 400MB and index is 160MB. -> -> With index on the single in4 column, I got: -> 78 seconds for a vacuum - vc_vaconeind() is called once - -> 121 seconds for vacuum after deleting a single row - vc_vaconeind() is called twice - -Hmmm,vc_vaconeind() takes pretty long time even if it does little. - -> 662 seconds for vacuum after deleting the entire table -> - -How about half of the rows deleted case ? -It would take longer time. - -Regards. - -Hiroshi Inoue -Inoue@tpf.co.jp - -From owner-pgsql-hackers@hub.org Fri Jan 21 12:00:49 2000 -Received: from hub.org (hub.org [216.126.84.1]) - by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id NAA13329 - for ; Fri, 21 Jan 2000 13:00:47 -0500 (EST) -Received: from localhost (majordom@localhost) - by hub.org (8.9.3/8.9.3) with SMTP id MAA96106; - Fri, 21 Jan 2000 12:55:34 -0500 (EST) - (envelope-from owner-pgsql-hackers) -Received: by hub.org (bulk_mailer v1.5); Fri, 21 Jan 2000 12:53:53 -0500 -Received: (from majordom@localhost) - by hub.org (8.9.3/8.9.3) id MAA95775 - for pgsql-hackers-outgoing; Fri, 21 Jan 2000 12:52:54 -0500 (EST) - (envelope-from owner-pgsql-hackers@postgreSQL.org) -Received: from candle.pha.pa.us (root@s5-03.ppp.op.net [209.152.195.67]) - by hub.org (8.9.3/8.9.3) with ESMTP id MAA95720 - for ; Fri, 21 Jan 2000 12:52:39 -0500 (EST) - (envelope-from pgman@candle.pha.pa.us) -Received: (from pgman@localhost) - by candle.pha.pa.us (8.9.0/8.9.0) id MAA12106; - Fri, 21 Jan 2000 12:51:53 -0500 (EST) -From: Bruce Momjian -Message-Id: <200001211751.MAA12106@candle.pha.pa.us> -Subject: [HACKERS] Re: vacuum timings -In-Reply-To: <3641.948433911@sss.pgh.pa.us> from Tom Lane at "Jan 21, 2000 00:51:51 - am" -To: Tom Lane -Date: Fri, 21 Jan 2000 12:51:53 -0500 (EST) -CC: PostgreSQL-development -X-Mailer: ELM [version 2.4ME+ PL66 (25)] -MIME-Version: 1.0 -Content-Type: text/plain; charset=US-ASCII -Content-Transfer-Encoding: 7bit -Sender: owner-pgsql-hackers@postgreSQL.org -Status: RO - -> Bruce Momjian writes: -> > I loaded 10,000,000 rows into CREATE TABLE test (x INTEGER); Table is -> > 400MB and index is 160MB. -> -> > With index on the single in4 column, I got: -> > 78 seconds for a vacuum -> > 121 seconds for vacuum after deleting a single row -> > 662 seconds for vacuum after deleting the entire table -> -> > With no index, I got: -> > 43 seconds for a vacuum -> > 43 seconds for vacuum after deleting a single row -> > 43 seconds for vacuum after deleting the entire table -> -> > I find this quite interesting. -> -> How long does it take to create the index on your setup --- ie, -> if vacuum did a drop/create index, would it be competitive? - -OK, new timings with -F enabled: - - index no index - 519 same load - 247 " first vacuum - 40 " other vacuums - - 1222 X index creation - 90 X first vacuum - 80 X other vacuums - - <1 90 delete one row - 121 38 vacuum after delete 1 row - - 346 344 delete all rows - 440 44 first vacuum - 20 <1 other vacuums(index is still same size) - -Conclusions: - - o indexes never get smaller - o drop/recreate index is slower than vacuum of indexes - -What other conclusions can be made? - --- - Bruce Momjian | http://www.op.net/~candle - pgman@candle.pha.pa.us | (610) 853-3000 - + If your life is a hard drive, | 830 Blythe Avenue - + Christ can be your backup. | Drexel Hill, Pennsylvania 19026 - -************ - -From scrappy@hub.org Fri Jan 21 12:45:38 2000 -Received: from thelab.hub.org (nat200.60.mpoweredpc.net [142.177.200.60]) - by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id NAA14380 - for ; Fri, 21 Jan 2000 13:45:29 -0500 (EST) -Received: from localhost (scrappy@localhost) - by thelab.hub.org (8.9.3/8.9.1) with ESMTP id OAA68289; - Fri, 21 Jan 2000 14:45:35 -0400 (AST) - (envelope-from scrappy@hub.org) -X-Authentication-Warning: thelab.hub.org: scrappy owned process doing -bs -Date: Fri, 21 Jan 2000 14:45:34 -0400 (AST) -From: The Hermit Hacker -To: Bruce Momjian -cc: Tom Lane , - PostgreSQL-development -Subject: Re: [HACKERS] Re: vacuum timings -In-Reply-To: <200001211751.MAA12106@candle.pha.pa.us> -Message-ID: -MIME-Version: 1.0 -Content-Type: TEXT/PLAIN; charset=US-ASCII -Status: RO - -On Fri, 21 Jan 2000, Bruce Momjian wrote: - -> OK, new timings with -F enabled: -> -> index no index -> 519 same load -> 247 " first vacuum -> 40 " other vacuums -> -> 1222 X index creation -> 90 X first vacuum -> 80 X other vacuums -> -> <1 90 delete one row -> 121 38 vacuum after delete 1 row -> -> 346 344 delete all rows -> 440 44 first vacuum -> 20 <1 other vacuums(index is still same size) -> -> Conclusions: -> -> o indexes never get smaller - -this one, I thought, was a known? if I remember right, Vadim changed it -so that space was reused, but index never shrunk in size ... no? - -Marc G. Fournier ICQ#7615664 IRC Nick: Scrappy -Systems Administrator @ hub.org -primary: scrappy@hub.org secondary: scrappy@{freebsd|postgresql}.org - - -From tgl@sss.pgh.pa.us Fri Jan 21 13:06:35 2000 -Received: from sss2.sss.pgh.pa.us (sss.pgh.pa.us [209.114.166.2]) - by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id OAA14618 - for ; Fri, 21 Jan 2000 14:06:33 -0500 (EST) -Received: from sss2.sss.pgh.pa.us (tgl@localhost [127.0.0.1]) - by sss2.sss.pgh.pa.us (8.9.3/8.9.3) with ESMTP id OAA16501; - Fri, 21 Jan 2000 14:06:31 -0500 (EST) -To: Bruce Momjian -cc: PostgreSQL-development -Subject: Re: vacuum timings -In-reply-to: <200001211751.MAA12106@candle.pha.pa.us> -References: <200001211751.MAA12106@candle.pha.pa.us> -Comments: In-reply-to Bruce Momjian - message dated "Fri, 21 Jan 2000 12:51:53 -0500" -Date: Fri, 21 Jan 2000 14:06:31 -0500 -Message-ID: <16498.948481591@sss.pgh.pa.us> -From: Tom Lane -Status: RO - -Bruce Momjian writes: -> Conclusions: -> o indexes never get smaller - -Which we knew... - -> o drop/recreate index is slower than vacuum of indexes - -Quite a few people have reported finding the opposite in practice. -You should probably try vacuuming after deleting or updating some -fraction of the rows, rather than just the all or none cases. - - regards, tom lane - -From dms@wplus.net Fri Jan 21 13:51:27 2000 -Received: from relay.wplus.net (relay.wplus.net [195.131.52.179]) - by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id OAA15623 - for ; Fri, 21 Jan 2000 14:51:24 -0500 (EST) -X-Real-To: pgman@candle.pha.pa.us -Received: from wplus.net (ppdms.dialup.wplus.net [195.131.52.71]) - by relay.wplus.net (8.9.1/8.9.1/wplus.2) with ESMTP id WAA89451; - Fri, 21 Jan 2000 22:46:19 +0300 (MSK) -Message-ID: <3888B822.28F79A1F@wplus.net> -Date: Fri, 21 Jan 2000 22:48:50 +0300 -From: Dmitry Samersoff -X-Mailer: Mozilla 4.7 [en] (WinNT; I) -X-Accept-Language: ru,en -MIME-Version: 1.0 -To: Tom Lane -CC: Bruce Momjian , - PostgreSQL-development -Subject: Re: [HACKERS] Re: vacuum timings -References: <200001211751.MAA12106@candle.pha.pa.us> <16498.948481591@sss.pgh.pa.us> -Content-Type: text/plain; charset=koi8-r -Content-Transfer-Encoding: 7bit -Status: ROr - -Tom Lane wrote: -> -> Bruce Momjian writes: -> > Conclusions: -> > o indexes never get smaller -> -> Which we knew... -> -> > o drop/recreate index is slower than vacuum of indexes -> -> Quite a few people have reported finding the opposite in practice. - -I'm one of them. On 1,5 GB table with three indices it about twice -slowly. -Probably becouse vacuuming indices brakes system cache policy. -(FreeBSD 3.3) - - - --- -Dmitry Samersoff, DM\S -dms@wplus.net http://devnull.wplus.net -* there will come soft rains - -From owner-pgsql-hackers@hub.org Fri Jan 21 14:04:08 2000 -Received: from hub.org (hub.org [216.126.84.1]) - by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id PAA16140 - for ; Fri, 21 Jan 2000 15:04:06 -0500 (EST) -Received: from localhost (majordom@localhost) - by hub.org (8.9.3/8.9.3) with SMTP id OAA34808; - Fri, 21 Jan 2000 14:59:30 -0500 (EST) - (envelope-from owner-pgsql-hackers) -Received: by hub.org (bulk_mailer v1.5); Fri, 21 Jan 2000 14:57:48 -0500 -Received: (from majordom@localhost) - by hub.org (8.9.3/8.9.3) id OAA34320 - for pgsql-hackers-outgoing; Fri, 21 Jan 2000 14:56:50 -0500 (EST) - (envelope-from owner-pgsql-hackers@postgreSQL.org) -Received: from candle.pha.pa.us (pgman@s5-03.ppp.op.net [209.152.195.67]) - by hub.org (8.9.3/8.9.3) with ESMTP id OAA34255 - for ; Fri, 21 Jan 2000 14:56:18 -0500 (EST) - (envelope-from pgman@candle.pha.pa.us) -Received: (from pgman@localhost) - by candle.pha.pa.us (8.9.0/8.9.0) id OAA15772; - Fri, 21 Jan 2000 14:54:22 -0500 (EST) -From: Bruce Momjian -Message-Id: <200001211954.OAA15772@candle.pha.pa.us> -Subject: Re: [HACKERS] Re: vacuum timings -In-Reply-To: <3888B822.28F79A1F@wplus.net> from Dmitry Samersoff at "Jan 21, - 2000 10:48:50 pm" -To: Dmitry Samersoff -Date: Fri, 21 Jan 2000 14:54:21 -0500 (EST) -CC: Tom Lane , - PostgreSQL-development -X-Mailer: ELM [version 2.4ME+ PL66 (25)] -MIME-Version: 1.0 -Content-Type: text/plain; charset=US-ASCII -Content-Transfer-Encoding: 7bit -Sender: owner-pgsql-hackers@postgreSQL.org -Status: RO - -[Charset koi8-r unsupported, filtering to ASCII...] -> Tom Lane wrote: -> > -> > Bruce Momjian writes: -> > > Conclusions: -> > > o indexes never get smaller -> > -> > Which we knew... -> > -> > > o drop/recreate index is slower than vacuum of indexes -> > -> > Quite a few people have reported finding the opposite in practice. -> -> I'm one of them. On 1,5 GB table with three indices it about twice -> slowly. -> Probably becouse vacuuming indices brakes system cache policy. -> (FreeBSD 3.3) - -OK, we are researching what things can be done to improve this. We are -toying with: - - lock table for less duration, or read lock - creating another copy of heap/indexes, and rename() over old files - improving heap vacuum speed - improving index vacuum speed - moving analyze out of vacuum - - --- - Bruce Momjian | http://www.op.net/~candle - pgman@candle.pha.pa.us | (610) 853-3000 - + If your life is a hard drive, | 830 Blythe Avenue - + Christ can be your backup. | Drexel Hill, Pennsylvania 19026 - -************ - -From scrappy@hub.org Fri Jan 21 14:12:16 2000 -Received: from thelab.hub.org (nat200.60.mpoweredpc.net [142.177.200.60]) - by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id PAA16521 - for ; Fri, 21 Jan 2000 15:12:13 -0500 (EST) -Received: from localhost (scrappy@localhost) - by thelab.hub.org (8.9.3/8.9.1) with ESMTP id QAA69039; - Fri, 21 Jan 2000 16:12:25 -0400 (AST) - (envelope-from scrappy@hub.org) -X-Authentication-Warning: thelab.hub.org: scrappy owned process doing -bs -Date: Fri, 21 Jan 2000 16:12:25 -0400 (AST) -From: The Hermit Hacker -To: Bruce Momjian -cc: Dmitry Samersoff , Tom Lane , - PostgreSQL-development -Subject: Re: [HACKERS] Re: vacuum timings -In-Reply-To: <200001211954.OAA15772@candle.pha.pa.us> -Message-ID: -MIME-Version: 1.0 -Content-Type: TEXT/PLAIN; charset=US-ASCII -Status: RO - -On Fri, 21 Jan 2000, Bruce Momjian wrote: - -> [Charset koi8-r unsupported, filtering to ASCII...] -> > Tom Lane wrote: -> > > -> > > Bruce Momjian writes: -> > > > Conclusions: -> > > > o indexes never get smaller -> > > -> > > Which we knew... -> > > -> > > > o drop/recreate index is slower than vacuum of indexes -> > > -> > > Quite a few people have reported finding the opposite in practice. -> > -> > I'm one of them. On 1,5 GB table with three indices it about twice -> > slowly. -> > Probably becouse vacuuming indices brakes system cache policy. -> > (FreeBSD 3.3) -> -> OK, we are researching what things can be done to improve this. We are -> toying with: -> -> lock table for less duration, or read lock - -if there is some way that we can work around the bug that I believe Tom -found with removing the lock altogether (ie. makig use of MVCC), I think -that would be the best option ... if not possible, at least get things -down to a table lock vs the whole database? - -a good example is the udmsearch that we are using on the site ... it uses -multiple tables to store the dictionary, each representing words of X size -... if I'm searching on a 4 letter word, and the whole database is locked -while it is working on the dictionary with 8 letter words, I'm sitting -there idle ... at least if we only locked the 8 letter table, everyone not -doing 8 letter searches can go on their merry way ... - -Slightly longer vacuum's, IMHO, are acceptable if, to the end users, its -as transparent as possible ... locking per table would be slightly slower, -I think, because once a table is finished, the next table would need to -have an exclusive lock put on it before starting, so you'd have to -possibly wait for that...? - -> creating another copy of heap/indexes, and rename() over old files - -sounds to me like introducing a large potential for error here ... - -> moving analyze out of vacuum - -I think that should be done anyway ... if we ever get to the point that -we're able to re-use rows in tables, then that would eliminate the -immediate requirement for vacuum, but still retain a requirement for a -periodic analyze ... no? - -Marc G. Fournier ICQ#7615664 IRC Nick: Scrappy -Systems Administrator @ hub.org -primary: scrappy@hub.org secondary: scrappy@{freebsd|postgresql}.org - - -From tgl@sss.pgh.pa.us Fri Jan 21 16:02:07 2000 -Received: from sss2.sss.pgh.pa.us (sss.pgh.pa.us [209.114.166.2]) - by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id RAA20290 - for ; Fri, 21 Jan 2000 17:02:06 -0500 (EST) -Received: from sss2.sss.pgh.pa.us (tgl@localhost [127.0.0.1]) - by sss2.sss.pgh.pa.us (8.9.3/8.9.3) with ESMTP id RAA09697; - Fri, 21 Jan 2000 17:02:06 -0500 (EST) -To: The Hermit Hacker -cc: Bruce Momjian , - PostgreSQL-development -Subject: Re: [HACKERS] Re: vacuum timings -In-reply-to: -References: -Comments: In-reply-to The Hermit Hacker - message dated "Fri, 21 Jan 2000 16:12:25 -0400" -Date: Fri, 21 Jan 2000 17:02:06 -0500 -Message-ID: <9694.948492126@sss.pgh.pa.us> -From: Tom Lane -Status: RO - -The Hermit Hacker writes: ->> lock table for less duration, or read lock - -> if there is some way that we can work around the bug that I believe Tom -> found with removing the lock altogether (ie. makig use of MVCC), I think -> that would be the best option ... if not possible, at least get things -> down to a table lock vs the whole database? - -Huh? VACUUM only requires an exclusive lock on the table it is -currently vacuuming; there's no database-wide lock. - -Even a single-table exclusive lock is bad, of course, if it's a large -table that's critical to a 24x7 application. Bruce was talking about -the possibility of having VACUUM get just a write lock on the table; -other backends could still read it, but not write it, during the vacuum -process. That'd be a considerable step forward for 24x7 applications, -I think. - -It looks like that could be done if we rewrote the table as a new file -(instead of compacting-in-place), but there's a problem when it comes -time to rename the new files into place. At that point you'd need to -get an exclusive lock to ensure all the readers are out of the table too ---- and upgrading from a plain lock to an exclusive lock is a well-known -recipe for deadlocks. Not sure if this can be solved. - - regards, tom lane - -From tgl@sss.pgh.pa.us Fri Jan 21 22:50:34 2000 -Received: from sss2.sss.pgh.pa.us (sss.pgh.pa.us [209.114.166.2]) - by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id XAA01657 - for ; Fri, 21 Jan 2000 23:50:28 -0500 (EST) -Received: from sss2.sss.pgh.pa.us (tgl@localhost [127.0.0.1]) - by sss2.sss.pgh.pa.us (8.9.3/8.9.3) with ESMTP id XAA19681; - Fri, 21 Jan 2000 23:50:13 -0500 (EST) -To: Bruce Momjian -cc: PostgreSQL-development -Subject: Re: vacuum timings -In-reply-to: <200001211751.MAA12106@candle.pha.pa.us> -References: <200001211751.MAA12106@candle.pha.pa.us> -Comments: In-reply-to Bruce Momjian - message dated "Fri, 21 Jan 2000 12:51:53 -0500" -Date: Fri, 21 Jan 2000 23:50:13 -0500 -Message-ID: <19678.948516613@sss.pgh.pa.us> -From: Tom Lane -Status: ROr - -Bruce Momjian writes: -> Conclusions: -> o drop/recreate index is slower than vacuum of indexes - -BTW, I did some profiling of CREATE INDEX this evening (quite -unintentionally actually; I was interested in COPY IN, but the pg_dump -script I used as driver happened to create some indexes too). I was -startled to discover that 60% of the runtime of CREATE INDEX is spent in -_bt_invokestrat (which is called from tuplesort.c's comparetup_index, -and exists only to figure out which specific comparison routine to call). -Of this, a whopping 4% was spent in the useful subroutine, int4gt. All -the rest went into lookup and validation checks that by rights should be -done once per index creation, not once per comparison. - -In short: a fairly straightforward bit of optimization will eliminate -circa 50% of the CPU time consumed by CREATE INDEX. All we need is to -figure out where to cache the lookup results. The optimization would -improve insertions and lookups in indexes, as well, if we can cache -the lookup results in those scenarios. - -This was for a table small enough that tuplesort.c could do the sort -entirely in memory, so I'm sure the gains would be smaller for a large -table that requires a disk-based sort. Still, it seems worth looking -into... - - regards, tom lane - -From owner-pgsql-hackers@hub.org Sat Jan 22 02:31:03 2000 -Received: from renoir.op.net (root@renoir.op.net [207.29.195.4]) - by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id DAA06743 - for ; Sat, 22 Jan 2000 03:31:02 -0500 (EST) -Received: from hub.org (hub.org [216.126.84.1]) by renoir.op.net (o1/$Revision: 1.7 $) with ESMTP id DAA07529 for ; Sat, 22 Jan 2000 03:25:13 -0500 (EST) -Received: from localhost (majordom@localhost) - by hub.org (8.9.3/8.9.3) with SMTP id DAA31900; - Sat, 22 Jan 2000 03:19:53 -0500 (EST) - (envelope-from owner-pgsql-hackers) -Received: by hub.org (bulk_mailer v1.5); Sat, 22 Jan 2000 03:17:56 -0500 -Received: (from majordom@localhost) - by hub.org (8.9.3/8.9.3) id DAA31715 - for pgsql-hackers-outgoing; Sat, 22 Jan 2000 03:16:58 -0500 (EST) - (envelope-from owner-pgsql-hackers@postgreSQL.org) -Received: from sd.tpf.co.jp (sd.tpf.co.jp [210.161.239.34]) - by hub.org (8.9.3/8.9.3) with ESMTP id DAA31647 - for ; Sat, 22 Jan 2000 03:16:26 -0500 (EST) - (envelope-from Inoue@tpf.co.jp) -Received: from mcadnote1 (ppm114.noc.fukui.nsk.ne.jp [210.161.188.33]) - by sd.tpf.co.jp (2.5 Build 2640 (Berkeley 8.8.6)/8.8.4) with SMTP - id RAA04754; Sat, 22 Jan 2000 17:14:43 +0900 -From: "Hiroshi Inoue" -To: "Tom Lane" , "Bruce Momjian" -Cc: "PostgreSQL-development" -Subject: RE: [HACKERS] Re: vacuum timings -Date: Sat, 22 Jan 2000 17:15:37 +0900 -Message-ID: -MIME-Version: 1.0 -Content-Type: text/plain; - charset="iso-2022-jp" -Content-Transfer-Encoding: 7bit -X-Priority: 3 (Normal) -X-MSMail-Priority: Normal -X-Mailer: Microsoft Outlook IMO, Build 9.0.2416 (9.0.2910.0) -X-MimeOLE: Produced By Microsoft MimeOLE V5.00.2314.1300 -In-Reply-To: <16498.948481591@sss.pgh.pa.us> -Importance: Normal -Sender: owner-pgsql-hackers@postgresql.org -Status: RO - -> -----Original Message----- -> From: owner-pgsql-hackers@postgresql.org -> [mailto:owner-pgsql-hackers@postgresql.org]On Behalf Of Tom Lane -> -> Bruce Momjian writes: -> > Conclusions: -> > o indexes never get smaller -> -> Which we knew... -> -> > o drop/recreate index is slower than vacuum of indexes -> -> Quite a few people have reported finding the opposite in practice. -> You should probably try vacuuming after deleting or updating some -> fraction of the rows, rather than just the all or none cases. -> - -Vacuum after delelting all rows isn't a worst case. -There's no moving in that case and vacuum doesn't need to call -index_insert() corresponding to the moving of heap tuples. - -Vacuum after deleting half of rows may be one of the worst case. -In this case,index_delete() is called as many times as 'delete all' -case and expensive index_insert() is called for moved_in tuples. - -Regards. - -Hiroshi Inoue -Inoue@tpf.co.jp - -************ - -From tgl@sss.pgh.pa.us Sat Jan 22 10:31:02 2000 -Received: from renoir.op.net (root@renoir.op.net [207.29.195.4]) - by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id LAA20882 - for ; Sat, 22 Jan 2000 11:31:00 -0500 (EST) -Received: from sss2.sss.pgh.pa.us (sss.pgh.pa.us [209.114.166.2]) by renoir.op.net (o1/$Revision: 1.7 $) with ESMTP id LAA26612 for ; Sat, 22 Jan 2000 11:12:44 -0500 (EST) -Received: from sss2.sss.pgh.pa.us (tgl@localhost [127.0.0.1]) - by sss2.sss.pgh.pa.us (8.9.3/8.9.3) with ESMTP id LAA20569; - Sat, 22 Jan 2000 11:11:26 -0500 (EST) -To: "Hiroshi Inoue" -cc: "Bruce Momjian" , - "PostgreSQL-development" -Subject: Re: [HACKERS] Re: vacuum timings -In-reply-to: -References: -Comments: In-reply-to "Hiroshi Inoue" - message dated "Sat, 22 Jan 2000 17:15:37 +0900" -Date: Sat, 22 Jan 2000 11:11:25 -0500 -Message-ID: <20566.948557485@sss.pgh.pa.us> -From: Tom Lane -Status: RO - -"Hiroshi Inoue" writes: -> Vacuum after deleting half of rows may be one of the worst case. - -Or equivalently, vacuum after updating all the rows. - - regards, tom lane - -From tgl@sss.pgh.pa.us Thu Jan 20 23:51:49 2000 -Received: from sss2.sss.pgh.pa.us (sss.pgh.pa.us [209.114.166.2]) - by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id AAA13919 - for ; Fri, 21 Jan 2000 00:51:47 -0500 (EST) -Received: from sss2.sss.pgh.pa.us (tgl@localhost [127.0.0.1]) - by sss2.sss.pgh.pa.us (8.9.3/8.9.3) with ESMTP id AAA03644; - Fri, 21 Jan 2000 00:51:51 -0500 (EST) -To: Bruce Momjian -cc: PostgreSQL-development -Subject: Re: vacuum timings -In-reply-to: <200001210543.AAA13592@candle.pha.pa.us> -References: <200001210543.AAA13592@candle.pha.pa.us> -Comments: In-reply-to Bruce Momjian - message dated "Fri, 21 Jan 2000 00:43:49 -0500" -Date: Fri, 21 Jan 2000 00:51:51 -0500 -Message-ID: <3641.948433911@sss.pgh.pa.us> -From: Tom Lane -Status: ROr - -Bruce Momjian writes: -> I loaded 10,000,000 rows into CREATE TABLE test (x INTEGER); Table is -> 400MB and index is 160MB. - -> With index on the single in4 column, I got: -> 78 seconds for a vacuum -> 121 seconds for vacuum after deleting a single row -> 662 seconds for vacuum after deleting the entire table - -> With no index, I got: -> 43 seconds for a vacuum -> 43 seconds for vacuum after deleting a single row -> 43 seconds for vacuum after deleting the entire table - -> I find this quite interesting. - -How long does it take to create the index on your setup --- ie, -if vacuum did a drop/create index, would it be competitive? - - regards, tom lane - -From pgsql-hackers-owner+M5909@hub.org Thu Aug 17 20:15:33 2000 -Received: from hub.org (root@hub.org [216.126.84.1]) - by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id UAA00644 - for ; Thu, 17 Aug 2000 20:15:32 -0400 (EDT) -Received: from hub.org (majordom@localhost [127.0.0.1]) - by hub.org (8.10.1/8.10.1) with SMTP id e7I0APm69660; - Thu, 17 Aug 2000 20:10:25 -0400 (EDT) -Received: from fw.wintelcom.net (bright@ns1.wintelcom.net [209.1.153.20]) - by hub.org (8.10.1/8.10.1) with ESMTP id e7I01Jm68072 - for ; Thu, 17 Aug 2000 20:01:19 -0400 (EDT) -Received: (from bright@localhost) - by fw.wintelcom.net (8.10.0/8.10.0) id e7I01IA20820 - for pgsql-hackers@postgresql.org; Thu, 17 Aug 2000 17:01:18 -0700 (PDT) -Date: Thu, 17 Aug 2000 17:01:18 -0700 -From: Alfred Perlstein -To: pgsql-hackers@postgresql.org -Subject: [HACKERS] VACUUM optimization ideas. -Message-ID: <20000817170118.K4854@fw.wintelcom.net> -Mime-Version: 1.0 -Content-Type: text/plain; charset=us-ascii -Content-Disposition: inline -User-Agent: Mutt/1.2.4i -X-Mailing-List: pgsql-hackers@postgresql.org -Precedence: bulk -Sender: pgsql-hackers-owner@hub.org -Status: ROr - -Here's two ideas I had for optimizing vacuum, I apologize in advance -if the ideas presented here are niave and don't take into account -the actual code that makes up postgresql. - -================ - -#1 - -Reducing the time vacuum must hold an exlusive lock on a table: - -The idea is that since rows are marked deleted it's ok for the -vacuum to fill them with data from the tail of the table as -long as no transaction is in progress that has started before -the row was deleted. - -This may allow the vacuum process to copyback all the data without -a lock, when all the copying is done it then aquires an exlusive lock -and does this: - -Aquire an exclusive lock. -Walk all the deleted data marking it as current. -Truncate the table. -Release the lock. - -Since the data is still marked invalid (right?) even if valid data -is copied into the space it should be ignored as long as there's no -transaction occurring that started before the data was invalidated. - -================ - -#2 - -Reducing the amount of scanning a vaccum must do: - -It would make sense that if a value of the earliest deleted chunk -was kept in a table then vacuum would not have to scan the entire -table in order to work, it would only need to start at the 'earliest' -invalidated row. - -The utility of this (at least for us) is that we have several tables -that will grow to hundreds of megabytes, however changes will only -happen at the tail end (recently added rows). If we could reduce the -amount of time spent in a vacuum state it would help us a lot. - -================ - -I'm wondering if these ideas make sense and may help at all. - -thanks, --- --Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org] - -From pgsql-hackers-owner+M5912@hub.org Fri Aug 18 01:36:14 2000 -Received: from hub.org (root@hub.org [216.126.84.1]) - by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id BAA07787 - for ; Fri, 18 Aug 2000 01:36:12 -0400 (EDT) -Received: from hub.org (majordom@localhost [127.0.0.1]) - by hub.org (8.10.1/8.10.1) with SMTP id e7I5Q2m38759; - Fri, 18 Aug 2000 01:26:04 -0400 (EDT) -Received: from courier02.adinet.com.uy (courier02.adinet.com.uy [206.99.44.245]) - by hub.org (8.10.1/8.10.1) with ESMTP id e7I5Bam35785 - for ; Fri, 18 Aug 2000 01:11:37 -0400 (EDT) -Received: from adinet.com.uy (haroldo@r207-50-240-116.adinet.com.uy [207.50.240.116]) - by courier02.adinet.com.uy (8.9.3/8.9.3) with ESMTP id CAA17259; - Fri, 18 Aug 2000 02:10:49 -0300 (GMT) -Message-ID: <399CC739.B9B13D18@adinet.com.uy> -Date: Fri, 18 Aug 2000 02:18:49 -0300 -From: hstenger@adinet.com.uy -Reply-To: hstenger@ieee.org -Organization: PRISMA, Servicio y Desarrollo -X-Mailer: Mozilla 4.72 [en] (X11; I; Linux 2.2.14 i586) -X-Accept-Language: en -MIME-Version: 1.0 -To: Alfred Perlstein , pgsql-hackers@postgresql.org -Subject: Re: [HACKERS] VACUUM optimization ideas. -References: <20000817170118.K4854@fw.wintelcom.net> -Content-Type: text/plain; charset=us-ascii -Content-Transfer-Encoding: 7bit -X-Mailing-List: pgsql-hackers@postgresql.org -Precedence: bulk -Sender: pgsql-hackers-owner@hub.org -Status: ROr - -Alfred Perlstein wrote: -> #1 -> -> Reducing the time vacuum must hold an exlusive lock on a table: -> -> The idea is that since rows are marked deleted it's ok for the -> vacuum to fill them with data from the tail of the table as -> long as no transaction is in progress that has started before -> the row was deleted. -> -> This may allow the vacuum process to copyback all the data without -> a lock, when all the copying is done it then aquires an exlusive lock -> and does this: -> -> Aquire an exclusive lock. -> Walk all the deleted data marking it as current. -> Truncate the table. -> Release the lock. -> -> Since the data is still marked invalid (right?) even if valid data -> is copied into the space it should be ignored as long as there's no -> transaction occurring that started before the data was invalidated. - -Yes, but nothing prevents newer transactions from modifying the _origin_ side of -the copied data _after_ it was copied, but before the Lock-Walk-Truncate-Unlock -cycle takes place, and so it seems unsafe. Maybe locking each record before -copying it up ... - -Regards, -Haroldo. - --- -----------------------+------------------------ - Haroldo Stenger | hstenger@ieee.org - Montevideo, Uruguay. | hstenger@adinet.com.uy -----------------------+------------------------ - Visit UYLUG Web Site: http://www.linux.org.uy ------------------------------------------------ - -From pgsql-hackers-owner+M5917@hub.org Fri Aug 18 09:41:33 2000 -Received: from hub.org (root@hub.org [216.126.84.1]) - by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id JAA05170 - for ; Fri, 18 Aug 2000 09:41:33 -0400 (EDT) -Received: from hub.org (majordom@localhost [127.0.0.1]) - by hub.org (8.10.1/8.10.1) with SMTP id e7IDVjm75143; - Fri, 18 Aug 2000 09:31:46 -0400 (EDT) -Received: from andie.ip23.net (andie.ip23.net [212.83.32.23]) - by hub.org (8.10.1/8.10.1) with ESMTP id e7IDPIm73296 - for ; Fri, 18 Aug 2000 09:25:18 -0400 (EDT) -Received: from imap1.ip23.net (imap1.ip23.net [212.83.32.35]) - by andie.ip23.net (8.9.3/8.9.3) with ESMTP id PAA58387; - Fri, 18 Aug 2000 15:25:12 +0200 (CEST) -Received: from ip23.net (spc.ip23.net [212.83.32.122]) - by imap1.ip23.net (8.9.3/8.9.3) with ESMTP id PAA59177; - Fri, 18 Aug 2000 15:41:28 +0200 (CEST) -Message-ID: <399D3938.582FDB49@ip23.net> -Date: Fri, 18 Aug 2000 15:25:12 +0200 -From: Sevo Stille -Organization: IP23 -X-Mailer: Mozilla 4.61 [en] (X11; I; Linux 2.2.10 i686) -X-Accept-Language: en, de -MIME-Version: 1.0 -To: Alfred Perlstein -CC: pgsql-hackers@postgresql.org -Subject: Re: [HACKERS] VACUUM optimization ideas. -References: <20000817170118.K4854@fw.wintelcom.net> -Content-Type: text/plain; charset=us-ascii -Content-Transfer-Encoding: 7bit -X-Mailing-List: pgsql-hackers@postgresql.org -Precedence: bulk -Sender: pgsql-hackers-owner@hub.org -Status: RO - -Alfred Perlstein wrote: - -> The idea is that since rows are marked deleted it's ok for the -> vacuum to fill them with data from the tail of the table as -> long as no transaction is in progress that has started before -> the row was deleted. - -Well, isn't one of the advantages of vacuuming in the reordering it -does? With a "fill deleted chunks" logic, we'd have far less order in -the databases. - -> This may allow the vacuum process to copyback all the data without -> a lock, - -Nope. Another process might update the values in between move and mark, -if the record is not locked. We'd either have to write-lock the entire -table for that period, write lock every item as it is moved, or lock, -move and mark on a per-record base. The latter would be slow, but it -could be done in a permanent low priority background process, utilizing -empty CPU cycles. Besides, it probably could not only be done simply -filling from the tail, but also moving up the records in a sorted -fashion. - -> #2 -> -> Reducing the amount of scanning a vaccum must do: -> -> It would make sense that if a value of the earliest deleted chunk -> was kept in a table then vacuum would not have to scan the entire -> table in order to work, it would only need to start at the 'earliest' -> invalidated row. - -Trivial to do. But of course #1 may imply that the physical ordering is -even less likely to be related to the logical ordering in a way where -this helps. - -> The utility of this (at least for us) is that we have several tables -> that will grow to hundreds of megabytes, however changes will only -> happen at the tail end (recently added rows). - -The tail is a relative position - except for the case where you add -temporary records to a constant default set, everything in the tail will -move, at least relatively, to the head after some time. - -> If we could reduce the -> amount of time spent in a vacuum state it would help us a lot. - -Rather: If we can reduce the time spent in a locked state while -vacuuming, it would help a lot. Being in a vacuum is not the issue - -even permanent vacuuming need not be an issue, if the locks it uses are -suitably short-time. - -Sevo - --- -sevo@ip23.net - -From pgsql-hackers-owner+M5911@hub.org Thu Aug 17 21:11:20 2000 -Received: from hub.org (root@hub.org [216.126.84.1]) - by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id VAA01882 - for ; Thu, 17 Aug 2000 21:11:20 -0400 (EDT) -Received: from hub.org (majordom@localhost [127.0.0.1]) - by hub.org (8.10.1/8.10.1) with SMTP id e7I119m80626; - Thu, 17 Aug 2000 21:01:09 -0400 (EDT) -Received: from acheron.rime.com.au (root@albatr.lnk.telstra.net [139.130.54.222]) - by hub.org (8.10.1/8.10.1) with ESMTP id e7I0wMm79870 - for ; Thu, 17 Aug 2000 20:58:22 -0400 (EDT) -Received: from oberon (Oberon.rime.com.au [203.8.195.100]) - by acheron.rime.com.au (8.9.3/8.9.3) with SMTP id KAA03215; - Fri, 18 Aug 2000 10:58:25 +1000 -Message-Id: <3.0.5.32.20000818105835.0280ade0@mail.rhyme.com.au> -X-Sender: pjw@mail.rhyme.com.au -X-Mailer: QUALCOMM Windows Eudora Pro Version 3.0.5 (32) -Date: Fri, 18 Aug 2000 10:58:35 +1000 -To: Chris Bitmead , - Ben Adida -From: Philip Warner -Subject: Re: [HACKERS] Inserting a select statement result into another - table -Cc: Andrew Selle , pgsql-hackers@postgresql.org -In-Reply-To: <399C7689.2DDDAD1D@nimrod.itg.telecom.com.au> -References: <20000817130517.A10909@upl.cs.wisc.edu> - <399BF555.43FB70C8@openforce.net> -Mime-Version: 1.0 -Content-Type: text/plain; charset="us-ascii" -X-Mailing-List: pgsql-hackers@postgresql.org -Precedence: bulk -Sender: pgsql-hackers-owner@hub.org -Status: O - -At 09:34 18/08/00 +1000, Chris Bitmead wrote: -> ->He does ask a legitimate question though. If you are going to have a ->LIMIT feature (which of course is not pure SQL), there seems no reason ->you shouldn't be able to insert the result into a table. - -This feature is supported by two commercial DBs: Dec/RDB and SQL/Server. I -have no idea if Oracle supports it, but it is such a *useful* feature that -I would be very surprised if it didn't. - - ->Ben Adida wrote: ->> ->> What is the purpose you're trying to accomplish with this order by? No -matter what, all the ->> rows where done='f' will be inserted, and you will not be left with any -indication of that ->> order once the rows are in the todolist table. - -I don't know what his *purpose* was, but the query should only insert the -first two rows from the select bacause of the limit). - ->> Andrew Selle wrote: ->> ->> > Alright. My situation is this. I have a list of things that need to -be done ->> > in a table called tasks. I have a list of users who will complete -these tasks. ->> > I want these users to be able to come in and "claim" the top 2 most -recent tasks ->> > that have been added. These tasks then get stored in a table called -todolist ->> > which stores who claimed the task, the taskid, and when the task was -claimed. ->> > For each time someone wants to claim some number of tasks, I want to -do something ->> > like ->> > ->> > INSERT INTO todolist ->> > SELECT taskid,'1',now() ->> > FROM tasks ->> > WHERE done='f' ->> > ORDER BY submit DESC ->> > LIMIT 2; - ----------------------------------------------------------------- -Philip Warner | __---_____ -Albatross Consulting Pty. Ltd. |----/ - \ -(A.B.N. 75 008 659 498) | /(@) ______---_ -Tel: (+61) 0500 83 82 81 | _________ \ -Fax: (+61) 0500 83 82 82 | ___________ | -Http://www.rhyme.com.au | / \| - | --________-- -PGP key available upon request, | / -and from pgp5.ai.mit.edu:11371 |/ - -From pgsql-hackers-owner+M29308@postgresql.org Mon Sep 23 09:47:54 2002 -Return-path: -Received: from postgresql.org (postgresql.org [64.49.215.8]) - by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g8NDlqd00289 - for ; Mon, 23 Sep 2002 09:47:53 -0400 (EDT) -Received: from localhost (postgresql.org [64.49.215.8]) - by postgresql.org (Postfix) with ESMTP - id 7CA64476497; Mon, 23 Sep 2002 09:43:28 -0400 (EDT) -Received: from postgresql.org (postgresql.org [64.49.215.8]) - by postgresql.org (Postfix) with SMTP - id EDA70475BC3; Mon, 23 Sep 2002 09:43:20 -0400 (EDT) -Received: from localhost (postgresql.org [64.49.215.8]) - by postgresql.org (Postfix) with ESMTP id 85264476479 - for ; Mon, 23 Sep 2002 09:43:15 -0400 (EDT) -Received: from www.pspl.co.in (www.pspl.co.in [202.54.11.65]) - by postgresql.org (Postfix) with ESMTP id C7899476477 - for ; Mon, 23 Sep 2002 09:43:12 -0400 (EDT) -Received: (from root@localhost) - by www.pspl.co.in (8.11.6/8.11.6) id g8NDiQ030526 - for ; Mon, 23 Sep 2002 19:14:26 +0530 -Received: from daithan (daithan.intranet.pspl.co.in [192.168.7.161]) - by www.pspl.co.in (8.11.6/8.11.0) with ESMTP id g8NDiQ330521; - Mon, 23 Sep 2002 19:14:26 +0530 -From: "Shridhar Daithankar" -To: pgsql-hackers@postgresql.org, pgsql-general@postgresql.org -Date: Mon, 23 Sep 2002 19:13:44 +0530 -MIME-Version: 1.0 -Subject: [HACKERS] Postgresql Automatic vacuum -Reply-To: shridhar_daithankar@persistent.co.in -Message-ID: <3D8F67E8.7500.4E0E180@localhost> -X-Mailer: Pegasus Mail for Windows (v4.02) -Content-Type: text/plain; charset=US-ASCII -Content-Transfer-Encoding: 7BIT -Content-Description: Mail message body -X-Virus-Scanned: by AMaViS new-20020517 -Precedence: bulk -Sender: pgsql-hackers-owner@postgresql.org -X-Virus-Scanned: by AMaViS new-20020517 -Status: OR - -Hello All, - -I have written a small daemon that can automatically vacuum PostgreSQL -database, depending upon activity per table. - -It sits on top of postgres statistics collector. The postgres installation -should have per row statistics collection enabled. - -Features are, - -* Vacuuming based on activity on the table -* Per table vacuum. So only heavily updated tables are vacuumed. -* multiple databases supported -* Performs 'vacuum analyze' only, so it will not block the database - - -The project location is -http://gborg.postgresql.org/project/pgavd/projdisplay.php - -Let me know for bugs/improvements and comments.. - -I am sure real world postgres installations has some sort of scripts doing -similar thing. This is an attempt to provide a generic interface to periodic -vacuum. - - -Bye - Shridhar - --- -The Abrams' Principle: The shortest distance between two points is off the -wall. - - ----------------------------(end of broadcast)--------------------------- -TIP 3: if posting/reading through Usenet, please send an appropriate -subscribe-nomail command to majordomo@postgresql.org so that your -message can get through to the mailing list cleanly - -From pgsql-hackers-owner+M29344@postgresql.org Tue Sep 24 02:42:36 2002 -Return-path: -Received: from postgresql.org (postgresql.org [64.49.215.8]) - by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g8O6gYg19416 - for ; Tue, 24 Sep 2002 02:42:35 -0400 (EDT) -Received: from localhost (postgresql.org [64.49.215.8]) - by postgresql.org (Postfix) with ESMTP - id 128704762AF; Tue, 24 Sep 2002 02:42:36 -0400 (EDT) -Received: from postgresql.org (postgresql.org [64.49.215.8]) - by postgresql.org (Postfix) with SMTP - id DE80C4760F5; Tue, 24 Sep 2002 02:42:32 -0400 (EDT) -Received: from localhost (postgresql.org [64.49.215.8]) - by postgresql.org (Postfix) with ESMTP id 40A8A475DBC - for ; Tue, 24 Sep 2002 02:42:25 -0400 (EDT) -Received: from relay.icomedias.com (relay.icomedias.com [62.99.232.66]) - by postgresql.org (Postfix) with ESMTP id 7ECC8475DAD - for ; Tue, 24 Sep 2002 02:42:23 -0400 (EDT) -Received: from loki ([10.192.17.128]) - by relay.icomedias.com (8.12.5/8.12.5) with ESMTP id g8O6g8BX014226; - Tue, 24 Sep 2002 08:42:09 +0200 -Content-Type: text/plain; - charset="iso-8859-1" -From: Mario Weilguni -To: shridhar_daithankar@persistent.co.in, matthew@zeut.net -Subject: Re: [HACKERS] Postgresql Automatic vacuum -Date: Tue, 24 Sep 2002 08:42:06 +0200 -User-Agent: KMail/1.4.3 -cc: pgsql-hackers@postgresql.org -References: <3D8F67E8.7500.4E0E180@localhost> <3D9050B2.9782.86E55C0@localhost> -In-Reply-To: <3D9050B2.9782.86E55C0@localhost> -MIME-Version: 1.0 -Message-ID: <200209240842.06459.mweilguni@sime.com> -avpresult: 0, ok, ok -X-Scanned-By: MIMEDefang 2.16 (www . roaringpenguin . com / mimedefang) -X-Virus-Scanned: by AMaViS new-20020517 -Precedence: bulk -Sender: pgsql-hackers-owner@postgresql.org -X-Virus-Scanned: by AMaViS new-20020517 -Content-Transfer-Encoding: 8bit -X-MIME-Autoconverted: from quoted-printable to 8bit by candle.pha.pa.us id g8O6gYg19416 -Status: OR - -Am Dienstag, 24. September 2002 08:16 schrieb Shridhar Daithankar: -> -> > I will play with it more and give you some more feedback. -> -> Awaiting that. -> - -IMO there are still several problems with that approach, namely: -* every database will get "polluted" with the autovacuum table, which is undesired -* the biggest problem is the ~/.pgavrc file. I think it should work like other postgres utils do, e.g. supporting -U, -d, .... -* it's not possible to use without activly administration the config file. it should be able to work without - adminstrator assistance. - -When this is a daemon, why not store the data in memory? Even with several thousands of tables the memory footprint would - still be small. And it should be possible to use for all databases without modifying a config file. - -Two weeks ago I began writing a similar daemon, but had no time yet to finish it. I've tried to avoid using fixed numbers (namely "vacuum table -after 1000 updates") and tried to make my own heuristic based on the statistics data and the size of the table. The reason is, for a large table 1000 entries might be -a small percentage and vacuum is not necessary, while for small tables 10 updates might be sufficient. - -Best regards, - Mario Weilguni - - ----------------------------(end of broadcast)--------------------------- -TIP 2: you can get off all lists at once with the unregister command - (send "unregister YourEmailAddressHere" to majordomo@postgresql.org) - -From pgsql-hackers-owner+M29345@postgresql.org Tue Sep 24 03:02:50 2002 -Return-path: -Received: from postgresql.org (postgresql.org [64.49.215.8]) - by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g8O72lg21051 - for ; Tue, 24 Sep 2002 03:02:48 -0400 (EDT) -Received: from localhost (postgresql.org [64.49.215.8]) - by postgresql.org (Postfix) with ESMTP - id 9B3EA4762F6; Tue, 24 Sep 2002 03:02:48 -0400 (EDT) -Received: from postgresql.org (postgresql.org [64.49.215.8]) - by postgresql.org (Postfix) with SMTP - id 902EA476020; Tue, 24 Sep 2002 03:02:45 -0400 (EDT) -Received: from localhost (postgresql.org [64.49.215.8]) - by postgresql.org (Postfix) with ESMTP id 98689475DAD - for ; Tue, 24 Sep 2002 03:02:18 -0400 (EDT) -Received: from www.pspl.co.in (www.pspl.co.in [202.54.11.65]) - by postgresql.org (Postfix) with ESMTP id 47B8647592C - for ; Tue, 24 Sep 2002 03:02:16 -0400 (EDT) -Received: (from root@localhost) - by www.pspl.co.in (8.11.6/8.11.6) id g8O73QQ16318 - for ; Tue, 24 Sep 2002 12:33:26 +0530 -Received: from daithan (daithan.intranet.pspl.co.in [192.168.7.161]) - by www.pspl.co.in (8.11.6/8.11.0) with ESMTP id g8O73Q316313 - for ; Tue, 24 Sep 2002 12:33:26 +0530 -From: "Shridhar Daithankar" -To: pgsql-hackers@postgresql.org -Date: Tue, 24 Sep 2002 12:32:43 +0530 -MIME-Version: 1.0 -Subject: Re: [HACKERS] Postgresql Automatic vacuum -Reply-To: shridhar_daithankar@persistent.co.in -Message-ID: <3D905B6B.1635.898382A@localhost> -References: <3D9050B2.9782.86E55C0@localhost> -In-Reply-To: <200209240842.06459.mweilguni@sime.com> -X-Mailer: Pegasus Mail for Windows (v4.02) -Content-Type: text/plain; charset=US-ASCII -Content-Transfer-Encoding: 7BIT -Content-Description: Mail message body -X-Virus-Scanned: by AMaViS new-20020517 -Precedence: bulk -Sender: pgsql-hackers-owner@postgresql.org -X-Virus-Scanned: by AMaViS new-20020517 -Status: OR - -On 24 Sep 2002 at 8:42, Mario Weilguni wrote: - -> Am Dienstag, 24. September 2002 08:16 schrieb Shridhar Daithankar: -> IMO there are still several problems with that approach, namely: -> * every database will get "polluted" with the autovacuum table, which is undesired - -I agree. But that was the best alternative I could see. explanation -follows..Besides I didn't want to touch PG meta data.. - -> * the biggest problem is the ~/.pgavrc file. I think it should work like other postgres utils do, e.g. supporting -U, -d, .... - -Shouldn't be a problem. The config stuff is working and I can add that. I would -rather term it a minor issue. On personal preference, I would just fire it -without any arguments. It's not a thing that you change daily. Configure it in -config file and done.. - -> * it's not possible to use without activly administration the config file. it should be able to work without -> adminstrator assistance. - -Well. I would call that tuning. Each admin can tune it. Yes it's an effort but -certainly not an active administration. - -> When this is a daemon, why not store the data in memory? Even with several thousands of tables the memory footprint would -> still be small. And it should be possible to use for all databases without modifying a config file. - -Well. When postgresql has ability to deal with arbitrary number of rows, it -seemed redundant to me to duplicate all those functionality. Why write lists -and arrays again and again? Let postgresql do it. - - -> Two weeks ago I began writing a similar daemon, but had no time yet to finish it. I've tried to avoid using fixed numbers (namely "vacuum table -> after 1000 updates") and tried to make my own heuristic based on the statistics data and the size of the table. The reason is, for a large table 1000 entries might be -> a small percentage and vacuum is not necessary, while for small tables 10 updates might be sufficient. - -Well, that fixed number is not really fixed but admin tunable, that too per -database. These are just defaults. Tune it to suit your needs. - -The objective of whole exercise is to get rid of periodic vacuum as this app. -shifts threshold to activity rather than time. - -Besides a table should be vacuumed when it starts affecting performance. On an -installation if a table a 1M rows and change 1K rows affects performance, there -will be a similar performance hit for a 100K rows table for 1K rows update. -Because overhead involved would be almost same.(Not disk space. pgavd does not -target vacuum full but tuple size should matter). - -At least me thinks so.. - -I plan to implement per table threshold in addition to per database thresholds. -But right now, it seems like overhead to me. Besides there is an item in TODO, -to shift unit of work from rows to blocks affected. I guess that takes care of -some of your points.. -Bye - Shridhar - --- -Jones' Second Law: The man who smiles when things go wrong has thought of -someone to blame it on. - - ----------------------------(end of broadcast)--------------------------- -TIP 5: Have you checked our extensive FAQ? - -http://www.postgresql.org/users-lounge/docs/faq.html - diff --git a/doc/TODO.detail/yacc b/doc/TODO.detail/yacc deleted file mode 100644 index 8e7fa8986c..0000000000 --- a/doc/TODO.detail/yacc +++ /dev/null @@ -1,402 +0,0 @@ -From selkovjr@mcs.anl.gov Sat Jul 25 05:31:05 1998 -Received: from renoir.op.net (root@renoir.op.net [209.152.193.4]) - by candle.pha.pa.us (8.8.5/8.8.5) with ESMTP id FAA16564 - for ; Sat, 25 Jul 1998 05:31:03 -0400 (EDT) -Received: from antares.mcs.anl.gov (mcs.anl.gov [140.221.9.6]) by renoir.op.net (o1/$ Revision: 1.18 $) with SMTP id FAA01775 for ; Sat, 25 Jul 1998 05:28:22 -0400 (EDT) -Received: from mcs.anl.gov (wit.mcs.anl.gov [140.221.5.148]) by antares.mcs.anl.gov (8.6.10/8.6.10) with ESMTP - id EAA28698 for ; Sat, 25 Jul 1998 04:27:05 -0500 -Sender: selkovjr@mcs.anl.gov -Message-ID: <35B9968D.21CF60A2@mcs.anl.gov> -Date: Sat, 25 Jul 1998 08:25:49 +0000 -From: "Gene Selkov, Jr." -Organization: MCS, Argonne Natl. Lab -X-Mailer: Mozilla 4.03 [en] (X11; I; Linux 2.0.32 i586) -MIME-Version: 1.0 -To: Bruce Momjian -Subject: position-aware scanners -References: <199807250524.BAA07296@candle.pha.pa.us> -Content-Type: text/plain; charset=us-ascii -Content-Transfer-Encoding: 7bit -Status: RO - -Bruce, - -I attached here (trough the web links) a couple examples, totally -irrelevant to postgres but good enough to discuss token locations. I -might as well try to patch the backend parser, though not sure how soon. - - -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -1. - -The first c parser I wrote, -http://wit.mcs.anl.gov/~selkovjr/unit-troff.tgz, is not very -sophisticated, so token locations reported by yyerr() may be slightly -incorrect (+/- one position depending on the existence and type of the -lookahead token. It is a filter used to typeset the units of measurement -with eqn. To use it, unpack the tar file and run make. The Makefile is -not too generic but I built it on various systems including linux, -freebsd and sunos 4.3. The invocation can be something like this: - -./check 0 parse "l**3/(mmoll*min)" -parse error, expecting `BASIC_UNIT' or `INTEGER' or `POSITIVE_NUMBER' or -`'('' - -l**3/(mmoll*min) - ^^^^^ - -Now to the guts. As far as I can imagine, the only way to consistently -keep track of each character read by the scanner (regardless of the -length of expressions it will match) is to redefine its YY_INPUT like -this: - -#undef YY_INPUT -#define YY_INPUT(buf,result,max_size) \ -{ \ - int c = (int) buffer[pos++]; \ - result = (c == '\0') ? YY_NULL : (buf[0] = c, 1); \ -} - -Here, buffer is the pointer to the origin of the string being scanned -and pos is a global variable, similar in usage to a file pointer (you -can both read and manipulate it at will). The buffer and the pointer are -initialized by the function - -void setString(char *s) -{ - buffer = s; - pos = 0; -} - -each time the new string is to be parsed. This (exportable) function is -part of the interface. - -In this simplistic design, yyerror() is part of the scanner module and -it uses the pos variable to report the location of unexpected tokens. -The downside of such arrangement is that in case of error condition, you -can't easily tell whether your context is current or lookahead token, it -just reports the position of the last token read (be it $ (end of -buffer) or something else): - -./check 0 convert "mol/foo" -parse error, expecting `BASIC_UNIT' or `INTEGER' or `POSITIVE_NUMBER' or -`'('' - -mol/foo - ^^^ - -(should be at the beginning of "foo") - -./check 0 convert "mmol//l" -parse error, expecting `BASIC_UNIT' or `INTEGER' or `POSITIVE_NUMBER' or -`'('' - -mmol//l - ^ - -(should be at the second '/') - - -I believe this is why most simple parsers made with yacc would report -parse errors being "at or near" some token, which is fair enough if the -expression is not too complex. - - -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -2. The second version of the same scanner, -http://wit.mcs.anl.gov/~selkovjr/scanner-example.tgz, addresses this -problem by recording exact locations of the tokens in each instance of -the token semantic data structure. The global, - -UNIT_YYSTYPE unit_yylval; - -would be normally used to export the token semantics (including its -original or modified text and location data) to the parser. -Unfortunately, I cannot show you the parser part in c, because that's -about when I stopped writing parsers in c. Instead, I included a small -test program, test.c, that mimics the parser's expectations for the -scanner data pretty well. I am assuming here that you are not interested -in digging someone else's ugly guts for relatively small bit of -information; let me know if I am wrong and I will send you the complete -perl code (also generated with bison). - -To run this example, unpack the tar file and run Make. Then do - - gcc test.c scanner.o - -and run a.out - -Note the line - - yylval = unit_getyylval(); - -in test.c. You will not normally need it in a c parser. It is enough to -define yylval as an external variable and link it to yylval in yylex() - -In the bison-generated parser, yylval gets pushed into a stack (pointed -to by yylsp) each time a new token is read. For each syntax rule, the -bison macros @1, @2, ... are just shortcuts to locations in the stack 1, -2, ... levels deep. In following code fragment, @3 refers to the -location info for the third term in the rule (INTEGER): - -(sorry about perl, but I think you can do the same things in c without -significant changes to your existing parser) - -term: base { - $$ = $1; - $$->{'order'} = 1; - } - | base EXP INTEGER { - $$ = $1; - $$->{'order'} = @3->{'text'}; - $$->{'scale'} = $$->{'scale'} ** $$->{'order'}; - if ( $$->{'order'} == 0 ) { - yyerror("Error: expecting a non-zero -integer exponent"); - YYERROR; - } - } - - -which translates to: - - ($yyn == 10) && do { - $yyval = $yyvsa[-1]; - $yyval->{'order'} = 1; - last SWITCH; - }; - - ($yyn == 11) && do { - $yyval = $yyvsa[-3]; - $yyval->{'order'} = $yylsa[-1]->{'text'} - $yyval->{'scale'} = $yyval->{'scale'} ** $yyval->{'order'}; - if ( $yyval->{'order'} == 0 ) { - yyerror("Error: expecting a non-zero integer -exponent"); - goto yyerrlab1 ; - } - last SWITCH; - }; - -In c, you will have a bit more complicated pointer arithmetic to adress -the stack, but the usage of objects will be the same. Note here that it -is convenient to keep all information about the token in its location -info, (yylsa, yylsp, yylval, @n), while everything relating to the value -of the expression, or to the parse tree, is better placed in the -semantic stack (yyssa, yyssp, yysval, $n). Also note that in some cases -you can do semantic checks inside rules and report useful messages -before or instead of invoking yyerror(); - -Finally, it is useful to make the following wrapper function around -external yylex() in order to maintain your own token stack. Unlike the -parser's internal stack which is only as deep as the rule being reduced, -this one can hold all tokens recognized during the current run, and that -can be extremely helpful for error reporting and any transformations you -may need. In this way, you can even scan (tokenize) the whole buffer -before handing it off to the parser (who knows, you may need a token -ahead of what is currently seen by the parser): - - -sub tokenize { - undef @tokenTable; - my ($tok, $text, $name, $unit, $first_line, $first_column, -$last_line, $last_column); - - while ( ($tok = &UnitLex::yylex()) > 0 ) { # this is where the -c-coded yylex is called, - # UnitLex is the perl -extension encapsulating it - ( $text, $name, $unit, $first_line, $first_column, $last_line, -$last_column ) = &UnitLex::getyylval; - push(@tokenTable, - Unit::yyltype->new ( - 'token' => $tok, - 'text' => $text, - 'name' => $name, - 'unit' => $unit, - 'first_line' => $first_line, - 'first_column' => $first_column, - 'last_line' => $last_line, - 'last_column' => $last_column, - ) - ) - } - -} - - -It is now a lot easier to handle various state-related problems, such as -backtracking and error reporting. The yylex() function as seen by the -parser might be constructed somewhat like this: - -sub yylex { - $yylloc = $tokenTable[$tokenNo]; # $tokenNo is a global; now -instead of a "file pointer", - # as in the first example, we have -a "token pointer" - undef $yylval; - - - # disregard this; name this block "computing semantic values" - if ( $yylloc->{'token'} == UNIT) { - $yylval = Unit::Operand->new( - 'unit' => Unit::Dict::unit($yylloc->{'unit'}), - 'base' => Unit::Dict::base($yylloc->{'unit'}), - 'scale' => Unit::Dict::scale($yylloc->{'unit'}), - 'scaleToBase' => Unit::Dict::scaleToBase($yylloc->{'unit'}), - 'loc' => $yylloc, - ); - } - elsif ( ($yylloc->{'token'} == INTEGER ) || ($yylloc->{'token'} == -POSITIVE_NUMBER) ) { - $yylval = Unit::Operand->new( - 'unit' => '1', - 'base' => '1', - 'scale' => 1, - 'scaleToBase' => 1, - 'loc' => $yylloc, - ); - } - - $tokenNo++; - return(%{$yylloc}->{'token'}); # This is all the parser needs to -know about this token. - # But we already made sure we saved -everything we need to know. -} - - -Now the most interesting part, the error reporting routine: - - -sub yyerror { - my ($str) = @_; - my ($message, $start, $end, $loc); - - $loc = $tokenTable[$tokenNo-1]; # This is the same as to say, - # "obtain the location info for the -current token" - - # You may use this routine for your own purposes or let parser use -it - if( $str ne 'parse error' ) { - $message = "$str instead of `" . $loc->{'name'} . "' <" . -$loc->{'text'} . ">, at line " . $loc->{'first_line'} . ":\n\ -n"; - } - else { - $message = "unexpected token `" . $loc->{'name'} . "' <" . -$loc->{'text'} . ">, at line " . loc->{'first_line'} . ":\n -\n"; - } - - $message .= $parseBuffer . "\n"; # that's the original string that -was used to set the parser buffer - - $message .= ( ' ' x ($loc->{'first_column'} + 1) ) . ( '^' x -length($loc->{'text'}) ). "\n"; - if( $str ne 'parse error' ) { - print STDERR "$str instead of `", $loc->{'name'}, "' {", -$loc->{'text'}, "}, at line ", $loc->{'first_line'}, ":\n\n"; - } - else { - print STDERR "unexpected token `", $loc->{'name'}, "' {", -$loc->{'text'}, "}, at line ", $loc->{'first_line'}, ":\n\n"; - } - - print STDERR "$parseBuffer\n"; - print STDERR ' ' x ($loc->{'first_column'} + 1), '^' x -length($loc->{'text'}), "\n"; -} - -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -Scanners used in these examples assume there is a single line of text on -the input (the first_line and last_line elements of yylloc are simply -ignored). If you want to be able to parse multi-line buffers, just add a -lex rule for '\n' that will increment the line count and reset the pos -variable to zero. - - -Ugly as it may seem, I find this approach extremely liberating. If the -grammar becomes too complicated for a LALR(1) parser, I can cascade -multiple parsers. The token table can then be used to reassemble parts -of original expression for subordinate parsers, preserving the location -info all the way down, so that subordinate parsers can report their -problems consistently. You probably don't need this, as SQL is very well -thought of and has parsable grammar. But it may be of some help, for -error reporting. - - ---Gene - -From pgsql-patches-owner+M1499@postgresql.org Sat Aug 4 13:11:53 2001 -Return-path: -Received: from postgresql.org (webmail.postgresql.org [216.126.85.28]) - by candle.pha.pa.us (8.10.1/8.10.1) with ESMTP id f74HBrh11339 - for ; Sat, 4 Aug 2001 13:11:53 -0400 (EDT) -Received: from postgresql.org.org (webmail.postgresql.org [216.126.85.28]) - by postgresql.org (8.11.3/8.11.4) with SMTP id f74H89655183; - Sat, 4 Aug 2001 13:08:09 -0400 (EDT) - (envelope-from pgsql-patches-owner+M1499@postgresql.org) -Received: from sss.pgh.pa.us ([192.204.191.242]) - by postgresql.org (8.11.3/8.11.4) with ESMTP id f74Gxb653074 - for ; Sat, 4 Aug 2001 12:59:37 -0400 (EDT) - (envelope-from tgl@sss.pgh.pa.us) -Received: from sss2.sss.pgh.pa.us (tgl@localhost [127.0.0.1]) - by sss.pgh.pa.us (8.11.4/8.11.4) with ESMTP id f74GtPC29183; - Sat, 4 Aug 2001 12:55:25 -0400 (EDT) -To: Dave Page -cc: "'Fernando Nasser'" , - Bruce Momjian , Neil Padgett , - pgsql-patches@postgresql.org -Subject: Re: [PATCHES] Patch for Improved Syntax Error Reporting -In-Reply-To: <8568FC767B4AD311AC33006097BCD3D61A2D70@woody.vale-housing.co.uk> -References: <8568FC767B4AD311AC33006097BCD3D61A2D70@woody.vale-housing.co.uk> -Comments: In-reply-to Dave Page - message dated "Sat, 04 Aug 2001 12:37:23 +0100" -Date: Sat, 04 Aug 2001 12:55:24 -0400 -Message-ID: <29180.996944124@sss.pgh.pa.us> -From: Tom Lane -Precedence: bulk -Sender: pgsql-patches-owner@postgresql.org -Status: OR - -Dave Page writes: -> Oh, I quite agree. I'm not adverse to updating my code, I just want to avoid -> users getting misleading messages until I come up with those updates. - -Hmm ... if they were actively misleading then I'd share your concern. - -I guess what you're thinking is that the error offset reported by the -backend won't correspond directly to what the user typed, and if the -user tries to use the offset to manually count off characters, he may -arrive at the wrong place? Good point. I'm not sure whether a message -like - - ERROR: parser: parse error at or near 'frum'; - POSITION: 42 - -would be likely to encourage people to try that. Thoughts? (I do think -this is a good argument for not embedding the position straight into the -main error message though...) - -One possible compromise is to combine the straight character-offset -approach with a simplistic context display: - - ERROR: parser: parse error at or near 'frum'; - POSITION: 42 ... oid,relname FRUM ... - -The idea is to define the "POSITION" field as an integer offset possibly -followed by whitespace and noise words. An updated client would grab -the offset, ignore the rest of the field, and do the right thing. A -not-updated client would display the entire message, and with any luck -the user would read it correctly. - - regards, tom lane - ----------------------------(end of broadcast)--------------------------- -TIP 5: Have you checked our extensive FAQ? - -http://www.postgresql.org/users-lounge/docs/faq.html -