Add savepoints discussion.

author Bruce Momjian <bruce@momjian.us>

Wed, 20 Feb 2002 00:44:30 +0000 (00:44 +0000)

committer Bruce Momjian <bruce@momjian.us>

Wed, 20 Feb 2002 00:44:30 +0000 (00:44 +0000)
author Bruce Momjian <bruce@momjian.us>
Wed, 20 Feb 2002 00:44:30 +0000 (00:44 +0000)
committer Bruce Momjian <bruce@momjian.us>
Wed, 20 Feb 2002 00:44:30 +0000 (00:44 +0000)
diff --git a/doc/TODO.detail/transactions b/doc/TODO.detail/transactions

index 8898580bc2e853aa0e5e1fc93cc83923bea2aeed..ce7af3e4b3fdafd93c1533e1b56cfeaaf17e419a 100644 (file)
--- a/doc/TODO.detail/transactions
+++ b/doc/TODO.detail/transactions
@@ -167,3 +167,1011 @@ http://groups.google.com/groups?hl=en&threadm=200108050432.f754Wdo11696%40candle
  Regards,
  Haroldo.
  
+From vmikheev@SECTORBASE.COM Wed Jan 23 18:23:04 2002
+Return-path: <vmikheev@SECTORBASE.COM>
+Received: from sectorbase2.sectorbase.com ([66.106.163.120])
+       by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g0NNN3U21442
+       for <pgman@candle.pha.pa.us>; Wed, 23 Jan 2002 18:23:04 -0500 (EST)
+Received: by sectorbase2.sectorbase.com with Internet Mail Service (5.5.2653.19)
+       id <DKXVZ14S>; Wed, 23 Jan 2002 15:22:52 -0800
+Message-ID: <3705826352029646A3E91C53F7189E32518483@sectorbase2.sectorbase.com>
+From: "Mikheev, Vadim" <vmikheev@SECTORBASE.COM>
+To: "'Bruce Momjian'" <pgman@candle.pha.pa.us>,
+   PostgreSQL-development
+  <pgsql-hackers@postgreSQL.org>
+Subject: RE: [HACKERS] Savepoints
+Date: Wed, 23 Jan 2002 15:22:42 -0800
+MIME-Version: 1.0
+X-Mailer: Internet Mail Service (5.5.2653.19)
+Content-Type: text/plain;
+       charset="iso-8859-1"
+Status: ORr
+
+> I have talked in the past about a possible implementation of
+> savepoints/nested transactions.  I would like to more formally outline
+> my ideas below.
+
+Well, I would like to do the same -:)
+
+> ...
+> There is no reason for other backend to be able to see savepoint undo
+> information, and keeping it private greatly simplifies the
+> implementation.
+
+Yes... and requires additional memory/disk space: we keep old records
+in data files and we'll store them again...
+
+How about: use overwriting smgr + put old records into rollback
+segments - RS - (you have to keep them somewhere till TX's running
+anyway) + use WAL only as REDO log (RS will be used to rollback TX'
+changes and WAL will be used for RS/data files recovery).
+Something like what Oracle does.
+
+Vadim
+
+From pgsql-hackers-owner+M18085=candle.pha.pa.us=pgman@postgresql.org Wed Jan 23 20:15:02 2002
+Return-path: <pgsql-hackers-owner+M18085=candle.pha.pa.us=pgman@postgresql.org>
+Received: from server1.pgsql.org (www.postgresql.org [64.49.215.9])
+       by candle.pha.pa.us (8.11.6/8.10.1) with SMTP id g0O1F1U26461
+       for <pgman@candle.pha.pa.us>; Wed, 23 Jan 2002 20:15:02 -0500 (EST)
+Received: (qmail 92866 invoked by alias); 24 Jan 2002 01:14:59 -0000
+Received: from unknown (HELO postgresql.org) (64.49.215.8)
+  by www.postgresql.org with SMTP; 24 Jan 2002 01:14:59 -0000
+Received: from candle.pha.pa.us (216-55-132-35.dsl.san-diego.abac.net [216.55.132.35])
+       by postgresql.org (8.11.3/8.11.4) with ESMTP id g0O18ml91949
+       for <pgsql-hackers@postgresql.org>; Wed, 23 Jan 2002 20:08:50 -0500 (EST)
+       (envelope-from pgman@candle.pha.pa.us)
+Received: (from pgman@localhost)
+       by candle.pha.pa.us (8.11.6/8.10.1) id g0O18jV26044;
+       Wed, 23 Jan 2002 20:08:45 -0500 (EST)
+From: Bruce Momjian <pgman@candle.pha.pa.us>
+Message-ID: <200201240108.g0O18jV26044@candle.pha.pa.us>
+Subject: Re: [HACKERS] Savepoints
+In-Reply-To: <3705826352029646A3E91C53F7189E32518483@sectorbase2.sectorbase.com>
+To: "Mikheev, Vadim" <vmikheev@SECTORBASE.COM>
+Date: Wed, 23 Jan 2002 20:08:45 -0500 (EST)
+cc: PostgreSQL-development <pgsql-hackers@postgresql.org>
+X-Mailer: ELM [version 2.4ME+ PL96 (25)]
+MIME-Version: 1.0
+Content-Transfer-Encoding: 7bit
+Content-Type: text/plain; charset=US-ASCII
+Precedence: bulk
+Sender: pgsql-hackers-owner@postgresql.org
+Status: OR
+
+Mikheev, Vadim wrote:
+> > I have talked in the past about a possible implementation of
+> > savepoints/nested transactions.  I would like to more formally outline
+> > my ideas below.
+> 
+> Well, I would like to do the same -:)
+
+Good.
+
+> > ...
+> > There is no reason for other backend to be able to see savepoint undo
+> > information, and keeping it private greatly simplifies the
+> > implementation.
+> 
+> Yes... and requires additional memory/disk space: we keep old records
+> in data files and we'll store them again...
+
+I was suggesting keeping only relid/tid or in some cases only relid. 
+Seems like one or the other will fit all needs:  relid/tid for update of
+a few rows, relid for many rows updated in the same table.  I saw no
+need to store the actual data.
+
+> How about: use overwriting smgr + put old records into rollback
+> segments - RS - (you have to keep them somewhere till TX's running
+> anyway) + use WAL only as REDO log (RS will be used to rollback TX'
+> changes and WAL will be used for RS/data files recovery).
+> Something like what Oracle does.
+
+Why record the old data rows rather than the tids?  While the
+transaction is running, the rows can't be moved anyway.  Also, why store
+them in a shared area.  That has additional requirements because one old
+transaction can require all transactions to keep their stuff around. 
+Why not just make it a private data file for each backend?
+
+-- 
+  Bruce Momjian                        |  http://candle.pha.pa.us
+  pgman@candle.pha.pa.us               |  (610) 853-3000
+  +  If your life is a hard drive,     |  830 Blythe Avenue
+  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
+
+---------------------------(end of broadcast)---------------------------
+TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
+
+From pgsql-hackers-owner+M18086=candle.pha.pa.us=pgman@postgresql.org Wed Jan 23 20:25:47 2002
+Return-path: <pgsql-hackers-owner+M18086=candle.pha.pa.us=pgman@postgresql.org>
+Received: from server1.pgsql.org (www.postgresql.org [64.49.215.9])
+       by candle.pha.pa.us (8.11.6/8.10.1) with SMTP id g0O1PkU26964
+       for <pgman@candle.pha.pa.us>; Wed, 23 Jan 2002 20:25:47 -0500 (EST)
+Received: (qmail 94878 invoked by alias); 24 Jan 2002 01:25:44 -0000
+Received: from unknown (HELO postgresql.org) (64.49.215.8)
+  by www.postgresql.org with SMTP; 24 Jan 2002 01:25:44 -0000
+Received: from candle.pha.pa.us (216-55-132-35.dsl.san-diego.abac.net [216.55.132.35])
+       by postgresql.org (8.11.3/8.11.4) with ESMTP id g0O1L1l94075
+       for <pgsql-hackers@postgreSQL.org>; Wed, 23 Jan 2002 20:21:01 -0500 (EST)
+       (envelope-from pgman@candle.pha.pa.us)
+Received: (from pgman@localhost)
+       by candle.pha.pa.us (8.11.6/8.10.1) id g0O1Kwm26748;
+       Wed, 23 Jan 2002 20:20:58 -0500 (EST)
+From: Bruce Momjian <pgman@candle.pha.pa.us>
+Message-ID: <200201240120.g0O1Kwm26748@candle.pha.pa.us>
+Subject: Re: [HACKERS] Savepoints
+In-Reply-To: <3705826352029646A3E91C53F7189E32518483@sectorbase2.sectorbase.com>
+To: "Mikheev, Vadim" <vmikheev@SECTORBASE.COM>
+Date: Wed, 23 Jan 2002 20:20:58 -0500 (EST)
+cc: PostgreSQL-development <pgsql-hackers@postgresql.org>
+X-Mailer: ELM [version 2.4ME+ PL96 (25)]
+MIME-Version: 1.0
+Content-Transfer-Encoding: 7bit
+Content-Type: text/plain; charset=US-ASCII
+Precedence: bulk
+Sender: pgsql-hackers-owner@postgresql.org
+Status: OR
+
+> > There is no reason for other backend to be able to see savepoint undo
+> > information, and keeping it private greatly simplifies the
+> > implementation.
+> 
+> Yes... and requires additional memory/disk space: we keep old records
+> in data files and we'll store them again...
+> 
+> How about: use overwriting smgr + put old records into rollback
+> segments - RS - (you have to keep them somewhere till TX's running
+> anyway) + use WAL only as REDO log (RS will be used to rollback TX'
+> changes and WAL will be used for RS/data files recovery).
+> Something like what Oracle does.
+
+I am sorry.  I see what you are saying now.  I missed the words
+"overwriting smgr".  You are suggesting going to an overwriting storage
+manager.  Is this to be done only because of savepoints.  Doesn't seem
+worth it when I have a possible solution without such a drastic change.
+Also, overwriting storage manager will require MVCC to read through
+there to get accurate MVCC visibility, right?
+
+-- 
+  Bruce Momjian                        |  http://candle.pha.pa.us
+  pgman@candle.pha.pa.us               |  (610) 853-3000
+  +  If your life is a hard drive,     |  830 Blythe Avenue
+  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
+
+---------------------------(end of broadcast)---------------------------
+TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
+
+From vmikheev@SECTORBASE.COM Wed Jan 23 21:03:29 2002
+Return-path: <vmikheev@SECTORBASE.COM>
+Received: from sectorbase2.sectorbase.com ([66.106.163.120])
+       by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g0O23TU28813
+       for <pgman@candle.pha.pa.us>; Wed, 23 Jan 2002 21:03:29 -0500 (EST)
+Received: by sectorbase2.sectorbase.com with Internet Mail Service (5.5.2653.19)
+       id <DKXVZFBY>; Wed, 23 Jan 2002 18:03:18 -0800
+Message-ID: <3705826352029646A3E91C53F7189E32518487@sectorbase2.sectorbase.com>
+From: "Mikheev, Vadim" <vmikheev@SECTORBASE.COM>
+To: "'Bruce Momjian'" <pgman@candle.pha.pa.us>
+cc: PostgreSQL-development <pgsql-hackers@postgreSQL.org>
+Subject: RE: [HACKERS] Savepoints
+Date: Wed, 23 Jan 2002 18:03:11 -0800
+MIME-Version: 1.0
+X-Mailer: Internet Mail Service (5.5.2653.19)
+Content-Type: text/plain;
+       charset="iso-8859-1"
+Status: ORr
+
+> > How about: use overwriting smgr + put old records into rollback
+> > segments - RS - (you have to keep them somewhere till TX's running
+> > anyway) + use WAL only as REDO log (RS will be used to rollback TX'
+> > changes and WAL will be used for RS/data files recovery).
+> > Something like what Oracle does.
+> 
+> I am sorry. I see what you are saying now. I missed the words
+
+And I'm sorry for missing your notes about storing relid+tid only.
+
+> "overwriting smgr". You are suggesting going to an overwriting
+> storage manager. Is this to be done only because of savepoints.
+
+No. One point I made a few monthes ago (and never got objections)
+is - why to keep old data in data files sooooo long?
+Imagine long running TX (eg pg_dump). Why other TX-s must read
+again and again completely useless (for them) old data we keep
+for pg_dump?
+
+> Doesn't seem worth it when I have a possible solution without
+> such a drastic change.
+> Also, overwriting storage manager will require MVCC to read
+> through there to get accurate MVCC visibility, right?
+
+Right... just like now non-overwriting smgr requires *ALL*
+TX-s to read old data in data files. But with overwriting smgr
+TX will read RS only when it is required and as far (much) as
+it is required.
+
+Simple solutions are not always the best ones.
+Compare Oracle and InterBase. Both have MVCC.
+Smgr-s are different. What RDBMS is more cool?
+Why doesn't Oracle use more simple non-overwriting smgr
+(as InterBase... and we do)?
+
+Vadim
+
+From dhogaza@pacifier.com Wed Jan 23 21:05:37 2002
+Return-path: <dhogaza@pacifier.com>
+Received: from comet.pacifier.com ([199.2.117.155])
+       by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g0O25bU28962
+       for <pgman@candle.pha.pa.us>; Wed, 23 Jan 2002 21:05:37 -0500 (EST)
+Received: from pacifier.com (dsl-dhogaza.pacifier.net [207.202.226.68])
+       by comet.pacifier.com (8.11.2/8.11.1) with ESMTP id g0O24qX29917;
+       Wed, 23 Jan 2002 18:04:52 -0800 (PST)
+Message-ID: <3C4F6BF0.2010406@pacifier.com>
+Date: Wed, 23 Jan 2002 18:05:36 -0800
+From: Don Baccus <dhogaza@pacifier.com>
+User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:0.9.7) Gecko/20011221
+X-Accept-Language: en-us
+MIME-Version: 1.0
+To: Bruce Momjian <pgman@candle.pha.pa.us>
+cc: "Mikheev, Vadim" <vmikheev@SECTORBASE.COM>,
+   PostgreSQL-development <pgsql-hackers@postgresql.org>
+Subject: Re: [HACKERS] Savepoints
+References: <200201240120.g0O1Kwm26748@candle.pha.pa.us>
+Content-Type: text/plain; charset=us-ascii; format=flowed
+Content-Transfer-Encoding: 7bit
+Status: OR
+
+Bruce Momjian wrote:
+
+
+> I am sorry.  I see what you are saying now.  I missed the words
+> "overwriting smgr".  You are suggesting going to an overwriting storage
+> manager.
+
+
+Overwriting storage managers don't suffer from unbounded growth of 
+datafiles until garbage collection (vacuum) is performed.  In fact, 
+there's no need for a vacuum-style utility.  The rollback segments only 
+need to keep around enough past history to rollback transactions that 
+are executing.
+
+Of course, then the size of your transactions are limited by the size of 
+your rollback segments, which in Oracle are fixed in length when you 
+build your database (there are ways to change this when you figure out 
+that you didn't pick a good number when creating it).
+
+ >Is this to be done only because of savepoints.
+
+Not in traditional storage managers such as Oracle uses.  The complexity 
+of managing visibility and the like are traded off against the fact that 
+you're not stuck ever needing to garbage collect a database that 
+occupies a roomful of disks.
+
+It's a trade-off.  PG's current storage manager seems to work awfully 
+well in a lot of common database scenarios, and Tom's new vacuum is 
+meant to help mitigate against the drawbacks.   But overwriting storage 
+managers certainly have their advantages, too.
+
+ >  Doesn't seem
+
+> worth it when I have a possible solution without such a drastic change.
+> Also, overwriting storage manager will require MVCC to read through
+> there to get accurate MVCC visibility, right?
+
+
+Yep...
+
+-- 
+Don Baccus
+Portland, OR
+http://donb.photo.net, http://birdnotes.net, http://openacs.org
+
+
+From Inoue@tpf.co.jp Thu Jan 24 11:34:48 2002
+Return-path: <Inoue@tpf.co.jp>
+Received: from p2272.nsk.ne.jp (p2272.nsk.ne.jp [210.145.18.145])
+       by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g0OGYjU23980
+       for <pgman@candle.pha.pa.us>; Thu, 24 Jan 2002 11:34:47 -0500 (EST)
+Received: from mcadnote1 (ppm132.noc.fukui.nsk.ne.jp [61.198.95.32])
+       by p2272.nsk.ne.jp (8.9.3/3.7W-20000722) with SMTP id BAA12147;
+       Fri, 25 Jan 2002 01:34:24 +0900 (JST)
+From: "Hiroshi Inoue" <Inoue@tpf.co.jp>
+To: "Mikheev, Vadim" <vmikheev@SECTORBASE.COM>
+cc: "PostgreSQL-development" <pgsql-hackers@postgreSQL.org>,
+   "'Bruce Momjian'" <pgman@candle.pha.pa.us>
+Subject: RE: [HACKERS] Savepoints
+Date: Fri, 25 Jan 2002 01:34:29 +0900
+Message-ID: <EKEJJICOHDIEMGPNIFIJKEFBGJAA.Inoue@tpf.co.jp>
+MIME-Version: 1.0
+Content-Type: text/plain;
+       charset="iso-8859-1"
+Content-Transfer-Encoding: 7bit
+X-Priority: 3 (Normal)
+X-MSMail-Priority: Normal
+X-Mailer: Microsoft Outlook IMO, Build 9.0.2416 (9.0.2910.0)
+In-Reply-To: <3705826352029646A3E91C53F7189E32518483@sectorbase2.sectorbase.com>
+X-MimeOLE: Produced By Microsoft MimeOLE V5.50.4522.1200
+Importance: Normal
+Status: OR
+
+> -----Original Message-----
+> From: Mikheev, Vadim
+> 
+> How about: use overwriting smgr + put old records into rollback
+> segments - RS - (you have to keep them somewhere till TX's running
+> anyway) + use WAL only as REDO log (RS will be used to rollback TX'
+> changes and WAL will be used for RS/data files recovery).
+> Something like what Oracle does.
+
+As long as we use no overwriting manager
+1) Rollback(data) isn't needed in case of a db crash.
+2) Rollback(data) isn't needed to cancal a transaction entirely.
+3) We don't need to mind the transaction size so much.
+
+We can't use the db any longer if a REDO recovery fails now.
+Under overwriting smgr we can't use the db any longer either
+if rollback fails. How could PG be not less reliable than now ?
+
+regards,
+Hiroshi Inoue
+
+From pgsql-hackers-owner+M18123=candle.pha.pa.us=pgman@postgresql.org Thu Jan 24 14:15:11 2002
+Return-path: <pgsql-hackers-owner+M18123=candle.pha.pa.us=pgman@postgresql.org>
+Received: from server1.pgsql.org (www.postgresql.org [64.49.215.9])
+       by candle.pha.pa.us (8.11.6/8.10.1) with SMTP id g0OJFAU12547
+       for <pgman@candle.pha.pa.us>; Thu, 24 Jan 2002 14:15:10 -0500 (EST)
+Received: (qmail 43413 invoked by alias); 24 Jan 2002 19:13:48 -0000
+Received: from unknown (HELO postgresql.org) (64.49.215.8)
+  by www.postgresql.org with SMTP; 24 Jan 2002 19:13:48 -0000
+Received: from sectorbase2.sectorbase.com ([66.106.163.120])
+       by postgresql.org (8.11.3/8.11.4) with ESMTP id g0OJC4l42011
+       for <pgsql-hackers@postgreSQL.org>; Thu, 24 Jan 2002 14:12:04 -0500 (EST)
+       (envelope-from vmikheev@SECTORBASE.COM)
+Received: by sectorbase2.sectorbase.com with Internet Mail Service (5.5.2653.19)
+       id <DKXVZF9P>; Thu, 24 Jan 2002 11:11:54 -0800
+Message-ID: <3705826352029646A3E91C53F7189E3251848B@sectorbase2.sectorbase.com>
+From: "Mikheev, Vadim" <vmikheev@SECTORBASE.COM>
+To: "'Hiroshi Inoue'" <Inoue@tpf.co.jp>
+cc: PostgreSQL-development <pgsql-hackers@postgresql.org>,
+   "'Bruce Momjian'"
+  <pgman@candle.pha.pa.us>
+Subject: Re: [HACKERS] Savepoints
+Date: Thu, 24 Jan 2002 11:11:52 -0800
+MIME-Version: 1.0
+X-Mailer: Internet Mail Service (5.5.2653.19)
+Content-Type: text/plain;
+       charset="iso-8859-1"
+Precedence: bulk
+Sender: pgsql-hackers-owner@postgresql.org
+Status: OR
+
+> > How about: use overwriting smgr + put old records into rollback
+> > segments - RS - (you have to keep them somewhere till TX's running
+> > anyway) + use WAL only as REDO log (RS will be used to rollback TX'
+> > changes and WAL will be used for RS/data files recovery).
+> > Something like what Oracle does.
+> 
+> As long as we use no overwriting manager
+> 1) Rollback(data) isn't needed in case of a db crash.
+> 2) Rollback(data) isn't needed to cancal a transaction entirely.
+
+-1) But vacuum must read a huge amount of data to remove dirt.
+-2) But TX-s must read data they are not interested at all.
+
+> 3) We don't need to mind the transaction size so much.
+
+-3) The same with overwriting smgr and WAL used *only as REDO log*:
+we are not required to keep WAL files for duration of transaction
+- as soon as server knows that changes logged in some WAL file
+applied to data files and RS on disk (and archived, for WAL-based
+BAR) that file may be reused/removed. Old data will still occupy
+space in RS but their space in data files will be available
+for reuse.
+
+> We can't use the db any longer if a REDO recovery fails now.
+
+Reset WAL and use/dump it. Annoying? Agreed. Fix bugs and/or
+use good RAM - whatever caused problem with restart.
+
+> Under overwriting smgr we can't use the db any longer either
+> if rollback fails.
+
+Why should it fail? Bugs? Fix them.
+
+> How could PG be not less reliable than now ?
+
+Is today' RG more reliable than Oracle, Informix, DB2?
+
+Vadim
+
+---------------------------(end of broadcast)---------------------------
+TIP 3: if posting/reading through Usenet, please send an appropriate
+subscribe-nomail command to majordomo@postgresql.org so that your
+message can get through to the mailing list cleanly
+
+From pgsql-hackers-owner+M18125=candle.pha.pa.us=pgman@postgresql.org Thu Jan 24 14:23:42 2002
+Return-path: <pgsql-hackers-owner+M18125=candle.pha.pa.us=pgman@postgresql.org>
+Received: from server1.pgsql.org (www.postgresql.org [64.49.215.9])
+       by candle.pha.pa.us (8.11.6/8.10.1) with SMTP id g0OJNfU13481
+       for <pgman@candle.pha.pa.us>; Thu, 24 Jan 2002 14:23:42 -0500 (EST)
+Received: (qmail 49604 invoked by alias); 24 Jan 2002 19:23:40 -0000
+Received: from unknown (HELO postgresql.org) (64.49.215.8)
+  by www.postgresql.org with SMTP; 24 Jan 2002 19:23:40 -0000
+Received: from candle.pha.pa.us (216-55-132-35.dsl.san-diego.abac.net [216.55.132.35])
+       by postgresql.org (8.11.3/8.11.4) with ESMTP id g0OJMTl48885
+       for <pgsql-hackers@postgreSQL.org>; Thu, 24 Jan 2002 14:22:29 -0500 (EST)
+       (envelope-from pgman@candle.pha.pa.us)
+Received: (from pgman@localhost)
+       by candle.pha.pa.us (8.11.6/8.10.1) id g0OJMJf13378;
+       Thu, 24 Jan 2002 14:22:19 -0500 (EST)
+From: Bruce Momjian <pgman@candle.pha.pa.us>
+Message-ID: <200201241922.g0OJMJf13378@candle.pha.pa.us>
+Subject: Re: [HACKERS] Savepoints
+In-Reply-To: <3705826352029646A3E91C53F7189E32518487@sectorbase2.sectorbase.com>
+To: "Mikheev, Vadim" <vmikheev@SECTORBASE.COM>
+Date: Thu, 24 Jan 2002 14:22:19 -0500 (EST)
+cc: PostgreSQL-development <pgsql-hackers@postgresql.org>
+X-Mailer: ELM [version 2.4ME+ PL96 (25)]
+MIME-Version: 1.0
+Content-Transfer-Encoding: 7bit
+Content-Type: text/plain; charset=US-ASCII
+Precedence: bulk
+Sender: pgsql-hackers-owner@postgresql.org
+Status: OR
+
+
+OK, I have had time to think about this, and I think I can put the two
+proposals into perspective.  I will use Vadim's terminology.
+
+In our current setup, rollback/undo data is kept in the same file as our
+live data.  This data is used for two purposes, one, for rollback of
+transactions, and perhaps subtransactions in the future, and second, for
+MVCC visibility for backends making changes.
+
+So, it seems the real question is whether a database modification should
+write the old data into a separate rollback segment and modify the heap
+data, or just create a new row and require the old row to be removed
+later by vacuum.
+
+Let's look at this behavior without MVCC.  In such cases, if someone
+tries to read a modified row, it will block and wait for the modifying
+backend to commit or rollback, when it will then continue.  In such
+cases, there is no reason for the waiting transaction to read the old
+data in the redo segment because it can't continue anyway.
+
+Now, with MVCC, the backend has to read through the redo segment to get
+the original data value for that row.
+
+Now, while rollback segments do help with cleaning out old UPDATE rows,
+how does it improve DELETE performance?  Seems it would just mark it as
+expired like we do now.
+
+One objection I always had to redo segments was that if I start a
+transaction in the morning and walk away, none of the redo segments can
+be recycled.  I was going to ask if we can force some type of redo
+segment compaction to keep old active rows and delete rows no longer
+visible to any transaction.  However, I now realize that our VACUUM has
+the same problem.  Tuples with XID >= GetOldestXmin() are not recycled,
+meaning we have this problem in our current implementation too.  (I
+wonder if our vacuum could be smarter about knowing which rows are
+visible, perhaps by creating a sorted list of xid's and doing a binary
+search on the list to determine visibility.)
+
+So, I guess the issue is, do we want to keep redo information in the
+main table, or split it out into redo segments.  Certainly we have to
+eliminate the Oracle restrictions that redo segment size is fixed at
+install time.
+
+The advantages of a redo segment is that hopefully we don't have
+transactions reading through irrelevant undo information.  The
+disadvantage is that we now have redo information grouped into table
+files where a sequential scan can be performed.  (Index scans of redo
+info are a performance problem currently.)  We would have to somehow
+efficiently access redo information grouped into the redo segments. 
+Perhaps a hash based in relid would help here.  Another disadvantage is
+concurrency.  When we start modifying heap data in place, we have to
+prevent other backends from seeing that modification while we move the
+old data to the redo segment.
+
+I guess my feeling is that if we can get vacuum to happen automatically,
+how is our current non-overwriting storage manager different from redo
+segments?
+
+One big advantage of redo segments would be that right now, if someone
+updates a row repeatedly, there are lots of heap versions of the row
+that are difficult to shrink in the table, while if they are in the redo
+segments, we can more efficiently remove them, and there is only on heap
+row.
+
+How is recovery handled with rollback segments?  Do we write old and new
+data to WAL?   We just write new data to WAL now, right?  Do we fsync
+rollback segments?
+
+Have I outlined this accurately?
+
+---------------------------------------------------------------------------
+
+Mikheev, Vadim wrote:
+> > > How about: use overwriting smgr + put old records into rollback
+> > > segments - RS - (you have to keep them somewhere till TX's running
+> > > anyway) + use WAL only as REDO log (RS will be used to rollback TX'
+> > > changes and WAL will be used for RS/data files recovery).
+> > > Something like what Oracle does.
+> > 
+> > I am sorry. I see what you are saying now. I missed the words
+> 
+> And I'm sorry for missing your notes about storing relid+tid only.
+> 
+> > "overwriting smgr". You are suggesting going to an overwriting
+> > storage manager. Is this to be done only because of savepoints.
+> 
+> No. One point I made a few monthes ago (and never got objections)
+> is - why to keep old data in data files sooooo long?
+> Imagine long running TX (eg pg_dump). Why other TX-s must read
+> again and again completely useless (for them) old data we keep
+> for pg_dump?
+> 
+> > Doesn't seem worth it when I have a possible solution without
+> > such a drastic change.
+> > Also, overwriting storage manager will require MVCC to read
+> > through there to get accurate MVCC visibility, right?
+> 
+> Right... just like now non-overwriting smgr requires *ALL*
+> TX-s to read old data in data files. But with overwriting smgr
+> TX will read RS only when it is required and as far (much) as
+> it is required.
+> 
+> Simple solutions are not always the best ones.
+> Compare Oracle and InterBase. Both have MVCC.
+> Smgr-s are different. What RDBMS is more cool?
+> Why doesn't Oracle use more simple non-overwriting smgr
+> (as InterBase... and we do)?
+> 
+> Vadim
+> 
+
+-- 
+  Bruce Momjian                        |  http://candle.pha.pa.us
+  pgman@candle.pha.pa.us               |  (610) 853-3000
+  +  If your life is a hard drive,     |  830 Blythe Avenue
+  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
+
+---------------------------(end of broadcast)---------------------------
+TIP 2: you can get off all lists at once with the unregister command
+    (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
+
+From pgsql-hackers-owner+M18141=candle.pha.pa.us=pgman@postgresql.org Thu Jan 24 19:43:38 2002
+Return-path: <pgsql-hackers-owner+M18141=candle.pha.pa.us=pgman@postgresql.org>
+Received: from server1.pgsql.org (www.postgresql.org [64.49.215.9])
+       by candle.pha.pa.us (8.11.6/8.10.1) with SMTP id g0P0hbU15026
+       for <pgman@candle.pha.pa.us>; Thu, 24 Jan 2002 19:43:38 -0500 (EST)
+Received: (qmail 28642 invoked by alias); 25 Jan 2002 00:43:24 -0000
+Received: from unknown (HELO postgresql.org) (64.49.215.8)
+  by www.postgresql.org with SMTP; 25 Jan 2002 00:43:24 -0000
+Received: from sd.tpf.co.jp (sd.tpf.co.jp [210.161.239.34])
+       by postgresql.org (8.11.3/8.11.4) with SMTP id g0P0YIl27208
+       for <pgsql-hackers@postgreSQL.org>; Thu, 24 Jan 2002 19:34:18 -0500 (EST)
+       (envelope-from Inoue@tpf.co.jp)
+Received: (qmail 3661 invoked from network); 25 Jan 2002 00:34:19 -0000
+Received: from unknown (HELO viscomail.tpf.co.jp) (100.0.0.108)
+  by sd2.tpf-fw-c.co.jp with SMTP; 25 Jan 2002 00:34:19 -0000
+Received: from tpf.co.jp (3dgateway1 [126.0.1.60])
+       by viscomail.tpf.co.jp (8.8.8+Sun/8.8.8) with ESMTP id JAA00756;
+       Fri, 25 Jan 2002 09:34:18 +0900 (JST)
+Message-ID: <3C50A807.32A29E09@tpf.co.jp>
+Date: Fri, 25 Jan 2002 09:34:15 +0900
+From: Hiroshi Inoue <Inoue@tpf.co.jp>
+X-Mailer: Mozilla 4.73 [ja] (Windows NT 5.0; U)
+X-Accept-Language: ja
+MIME-Version: 1.0
+To: "Mikheev, Vadim" <vmikheev@SECTORBASE.COM>
+cc: PostgreSQL-development <pgsql-hackers@postgresql.org>,
+   "'Bruce Momjian'" <pgman@candle.pha.pa.us>
+Subject: Re: [HACKERS] Savepoints
+References: <3705826352029646A3E91C53F7189E3251848B@sectorbase2.sectorbase.com>
+Content-Type: text/plain; charset=iso-2022-jp
+Content-Transfer-Encoding: 7bit
+Precedence: bulk
+Sender: pgsql-hackers-owner@postgresql.org
+Status: OR
+
+"Mikheev, Vadim" wrote:
+> 
+> > > How about: use overwriting smgr + put old records into rollback
+> > > segments - RS - (you have to keep them somewhere till TX's running
+> > > anyway) + use WAL only as REDO log (RS will be used to rollback TX'
+> > > changes and WAL will be used for RS/data files recovery).
+> > > Something like what Oracle does.
+> >
+> > As long as we use no overwriting manager
+> > 1) Rollback(data) isn't needed in case of a db crash.
+> > 2) Rollback(data) isn't needed to cancal a transaction entirely.
+> 
+> -1) But vacuum must read a huge amount of data to remove dirt.
+> -2) But TX-s must read data they are not interested at all.
+> 
+> > 3) We don't need to mind the transaction size so much.
+> 
+> -3) The same with overwriting smgr and WAL used *only as REDO log*:
+
+The larger RS becomes the longer it would take time to cancel
+the transaction whereas it is executed in a momemnt under no
+overwriting smgr and for example if RS exhausted all disk space
+is PG really safe ? Other backends would also fail because they
+couldn't write RS any mode. Many transactions would execute
+UNDO operations simultaneously but there's no space to write
+WALs (UNDO operations must be written to WAL also) and PG
+system would abort. And could PG restart under such situations ?
+Even though there's a way to recover from the situation, I
+think we should avoid such dangerous situations from the
+first. Basically recovery operations should never fail.
+ 
+> 
+> > We can't use the db any longer if a REDO recovery fails now.
+> 
+> Reset WAL and use/dump it. Annoying? Agreed. Fix bugs and/or
+> use good RAM - whatever caused problem with restart.
+
+As I already mentioned recovery operations should never fail.
+> 
+> > Under overwriting smgr we can't use the db any longer either
+> > if rollback fails.
+> 
+> Why should it fail? Bugs? Fix them.
+
+Rollback operations are executed much more often than
+REDO recovery and it is hard to fix such bugs once PG
+was released. Most people in such troubles have no
+time to persue the cause. In reality I replied to the
+PG restart troubles twice (with --wal-debug and pg_resetxlog
+suggestions ) in Japan but got no further replies.
+
+> 
+> > How could PG be not less reliable than now ?
+> 
+> Is today' RG more reliable than Oracle, Informix, DB2?
+
+I have never been and would never be optiomistic
+about recovery. Is 7.1 more reliable than 7.0 from the
+recovery POV ? I see no reason why overwriting smgr is
+more relaible than no overwriting smgr as for recovery.
+
+regards,
+Hiroshi Inoue
+
+---------------------------(end of broadcast)---------------------------
+TIP 6: Have you searched our list archives?
+
+http://archives.postgresql.org
+
+From ZeugswetterA@spardat.at Fri Jan 25 09:21:40 2002
+Return-path: <ZeugswetterA@spardat.at>
+Received: from smxsat1.smxs.net (smxsat1.smxs.net [213.150.10.1])
+       by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g0PELde10640
+       for <pgman@candle.pha.pa.us>; Fri, 25 Jan 2002 09:21:39 -0500 (EST)
+Received: from m01x1.s-mxs.net [10.3.55.201]
+       by smxsat1.smxs.net
+       with XWall v3.18f ;
+       Fri, 25 Jan 2002 15:22:51 +0100
+Received: from m0103.s-mxs.net [10.3.55.3]
+       by m01x1.s-mxs.net
+       with XWall v3.18a ;
+       Fri, 25 Jan 2002 15:21:23 +0100
+Received: from m0114.s-mxs.net ([10.3.55.14]) by m0103.s-mxs.net with Microsoft SMTPSVC(5.0.2195.2966);
+  Fri, 25 Jan 2002 15:21:22 +0100
+X-MimeOLE: Produced By Microsoft Exchange V6.0.5762.3
+content-class: urn:content-classes:message
+MIME-Version: 1.0
+Content-Type: text/plain;
+       charset="iso-8859-1"
+Subject: RE: [HACKERS] Savepoints
+Date: Fri, 25 Jan 2002 15:21:22 +0100
+Message-ID: <46C15C39FEB2C44BA555E356FBCD6FA42128DE@m0114.s-mxs.net>
+Thread-Topic: [HACKERS] Savepoints
+Thread-Index: AcGkZ8SMKn//UUTjS3mi+qC7+gZAwwBQ4YMA
+From: "Zeugswetter Andreas SB SD" <ZeugswetterA@spardat.at>
+To: "Mikheev, Vadim" <vmikheev@SECTORBASE.COM>,
+   "Bruce Momjian" <pgman@candle.pha.pa.us>,
+   "PostgreSQL-development" <pgsql-hackers@postgresql.org>
+X-OriginalArrivalTime: 25 Jan 2002 14:21:22.0648 (UTC) FILETIME=[9090BD80:01C1A5AB]
+Content-Transfer-Encoding: 8bit
+X-MIME-Autoconverted: from quoted-printable to 8bit by candle.pha.pa.us id g0PELde10640
+Status: OR
+
+Vadim wrote:
+> How about: use overwriting smgr + put old records into rollback
+> segments - RS - (you have to keep them somewhere till TX's running
+> anyway) + use WAL only as REDO log (RS will be used to rollback TX'
+> changes and WAL will be used for RS/data files recovery).
+> Something like what Oracle does.
+
+We have all the info we need in WAL and in the old rows,
+why would you want to write them to RS ?
+You only need RS for overwriting smgr.
+
+Andreas
+
+From pgsql-hackers-owner+M18209=candle.pha.pa.us=pgman@postgresql.org Fri Jan 25 16:14:02 2002
+Return-path: <pgsql-hackers-owner+M18209=candle.pha.pa.us=pgman@postgresql.org>
+Received: from server1.pgsql.org (www.postgresql.org [64.49.215.9])
+       by candle.pha.pa.us (8.11.6/8.10.1) with SMTP id g0PLE1e19182
+       for <pgman@candle.pha.pa.us>; Fri, 25 Jan 2002 16:14:01 -0500 (EST)
+Received: (qmail 85111 invoked by alias); 25 Jan 2002 21:13:59 -0000
+Received: from unknown (HELO postgresql.org) (64.49.215.8)
+  by www.postgresql.org with SMTP; 25 Jan 2002 21:13:59 -0000
+Received: from smxsat1.smxs.net (smxsat1.smxs.net [213.150.10.1])
+       by postgresql.org (8.11.3/8.11.4) with ESMTP id g0PL48l79366
+       for <pgsql-hackers@postgresql.org>; Fri, 25 Jan 2002 16:04:09 -0500 (EST)
+       (envelope-from ZeugswetterA@spardat.at)
+Received: from m01x1.s-mxs.net [10.3.55.201]
+       by smxsat1.smxs.net
+       with XWall v3.18f ;
+       Fri, 25 Jan 2002 22:05:21 +0100
+Received: from m0102.s-mxs.net [10.3.55.2]
+       by m01x1.s-mxs.net
+       with XWall v3.18a ;
+       Fri, 25 Jan 2002 22:03:54 +0100
+Received: from m0114.s-mxs.net ([10.3.55.14]) by m0102.s-mxs.net with Microsoft SMTPSVC(5.0.2195.2966);
+  Fri, 25 Jan 2002 22:03:53 +0100
+X-MimeOLE: Produced By Microsoft Exchange V6.0.5762.3
+content-class: urn:content-classes:message
+MIME-Version: 1.0
+Content-Type: text/plain;
+       charset="iso-8859-1"
+Subject: Re: [HACKERS] Savepoints
+Date: Fri, 25 Jan 2002 22:03:53 +0100
+Message-ID: <46C15C39FEB2C44BA555E356FBCD6FA41EB4C4@m0114.s-mxs.net>
+Thread-Topic: [HACKERS] Savepoints
+Thread-Index: AcGlDMGVwSWndt4kT1C7QhclLvQPWgA1arbw
+From: "Zeugswetter Andreas SB SD" <ZeugswetterA@spardat.at>
+To: "Bruce Momjian" <pgman@candle.pha.pa.us>,
+   "Mikheev, Vadim" <vmikheev@SECTORBASE.COM>
+cc: "PostgreSQL-development" <pgsql-hackers@postgresql.org>
+X-OriginalArrivalTime: 25 Jan 2002 21:03:53.0685 (UTC) FILETIME=[CBB48850:01C1A5E3]
+Content-Transfer-Encoding: 8bit
+X-MIME-Autoconverted: from quoted-printable to 8bit by postgresql.org id g0PLDAm83732
+Precedence: bulk
+Sender: pgsql-hackers-owner@postgresql.org
+Status: ORr
+
+
+> Now, with MVCC, the backend has to read through the redo segment to get
+
+You mean rollback segment, but ...
+
+> the original data value for that row.
+
+Will only need to be looked up if the row is currently beeing modified by 
+a not yet comitted txn (at least in the default read committed mode)  
+
+> 
+> Now, while rollback segments do help with cleaning out old UPDATE rows,
+> how does it improve DELETE performance?  Seems it would just mark it as
+> expired like we do now.
+
+delete would probably be: 
+1. mark original deleted and write whole row to RS
+
+I don't think you would like to mix looking up deleted rows in heap
+but updated rows in RS
+
+Andreas
+
+PS: not that I like overwrite with MVCC now
+If you think of VACUUM as garbage collection PG is highly trendy with
+the non-overwriting smgr.
+
+---------------------------(end of broadcast)---------------------------
+TIP 5: Have you checked our extensive FAQ?
+
+http://www.postgresql.org/users-lounge/docs/faq.html
+
+From pgsql-hackers-owner+M18211=candle.pha.pa.us=pgman@postgresql.org Fri Jan 25 16:53:45 2002
+Return-path: <pgsql-hackers-owner+M18211=candle.pha.pa.us=pgman@postgresql.org>
+Received: from server1.pgsql.org (www.postgresql.org [64.49.215.9])
+       by candle.pha.pa.us (8.11.6/8.10.1) with SMTP id g0PLrie22174
+       for <pgman@candle.pha.pa.us>; Fri, 25 Jan 2002 16:53:44 -0500 (EST)
+Received: (qmail 96831 invoked by alias); 25 Jan 2002 21:53:43 -0000
+Received: from unknown (HELO postgresql.org) (64.49.215.8)
+  by www.postgresql.org with SMTP; 25 Jan 2002 21:53:43 -0000
+Received: from smxsat1.smxs.net (smxsat1.smxs.net [213.150.10.1])
+       by postgresql.org (8.11.3/8.11.4) with ESMTP id g0PLpRl96298
+       for <pgsql-hackers@postgresql.org>; Fri, 25 Jan 2002 16:51:27 -0500 (EST)
+       (envelope-from ZeugswetterA@spardat.at)
+Received: from m01x1.s-mxs.net [10.3.55.201]
+       by smxsat1.smxs.net
+       with XWall v3.18f ;
+       Fri, 25 Jan 2002 22:52:54 +0100
+Received: from m0103.s-mxs.net [10.3.55.3]
+       by m01x1.s-mxs.net
+       with XWall v3.18a ;
+       Fri, 25 Jan 2002 22:51:25 +0100
+Received: from m0114.s-mxs.net ([10.3.55.14]) by m0103.s-mxs.net with Microsoft SMTPSVC(5.0.2195.2966);
+  Fri, 25 Jan 2002 22:51:25 +0100
+X-MimeOLE: Produced By Microsoft Exchange V6.0.5762.3
+content-class: urn:content-classes:message
+MIME-Version: 1.0
+Content-Type: text/plain;
+       charset="iso-8859-1"
+Subject: Re: [HACKERS] Savepoints
+Date: Fri, 25 Jan 2002 22:51:24 +0100
+Message-ID: <46C15C39FEB2C44BA555E356FBCD6FA41EB4C5@m0114.s-mxs.net>
+Thread-Topic: [HACKERS] Savepoints
+Thread-Index: AcGlznYKFcqoYpMnSlGQHhQuEf6LuAAGpxnQ
+From: "Zeugswetter Andreas SB SD" <ZeugswetterA@spardat.at>
+To: "Mikheev, Vadim" <vmikheev@SECTORBASE.COM>
+cc: <pgsql-hackers@postgresql.org>
+X-OriginalArrivalTime: 25 Jan 2002 21:51:25.0008 (UTC) FILETIME=[6F39E500:01C1A5EA]
+Content-Transfer-Encoding: 8bit
+X-MIME-Autoconverted: from quoted-printable to 8bit by postgresql.org id g0PLrP196418
+Precedence: bulk
+Sender: pgsql-hackers-owner@postgresql.org
+Status: OR
+
+
+> > > How about: use overwriting smgr + put old records into rollback
+> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+> > > segments - RS - (you have to keep them somewhere till TX's running
+> > > anyway) + use WAL only as REDO log (RS will be used to 
+> rollback TX'
+> > > changes and WAL will be used for RS/data files recovery).
+> > > Something like what Oracle does.
+> > 
+> > We have all the info we need in WAL and in the old rows,
+> > why would you want to write them to RS ?
+> > You only need RS for overwriting smgr.
+> 
+> This is what I'm saying - implement Overwriting smgr...
+
+Yes I am sorry, I am catching up on email and had not read Bruce's 
+comment (nor yours correctly) :-(
+
+I was also long in the pro overwriting camp, because I am used to 
+non MVCC dbs like DB/2 and Informix. (which I like very much) 
+But I am starting to doubt that overwriting is really so good for
+an MVCC db. And I don't think PG wants to switch to non MVCC :-)
+
+Imho it would only need a much more aggressive VACUUM backend.
+(aka garbage collector :-) Maybe It could be designed to sniff the 
+redo log (buffer) to get a hint at what to actually clean out next.
+
+Andreas
+
+---------------------------(end of broadcast)---------------------------
+TIP 4: Don't 'kill -9' the postmaster
+
+From pgsql-hackers-owner+M18218=candle.pha.pa.us=pgman@postgresql.org Fri Jan 25 19:14:24 2002
+Return-path: <pgsql-hackers-owner+M18218=candle.pha.pa.us=pgman@postgresql.org>
+Received: from server1.pgsql.org (www.postgresql.org [64.49.215.9])
+       by candle.pha.pa.us (8.11.6/8.10.1) with SMTP id g0Q0ENe03543
+       for <pgman@candle.pha.pa.us>; Fri, 25 Jan 2002 19:14:23 -0500 (EST)
+Received: (qmail 22482 invoked by alias); 26 Jan 2002 00:13:55 -0000
+Received: from unknown (HELO postgresql.org) (64.49.215.8)
+  by www.postgresql.org with SMTP; 26 Jan 2002 00:13:55 -0000
+Received: from candle.pha.pa.us (216-55-132-35.dsl.san-diego.abac.net [216.55.132.35])
+       by postgresql.org (8.11.3/8.11.4) with ESMTP id g0PNw1l20714
+       for <pgsql-hackers@postgresql.org>; Fri, 25 Jan 2002 18:58:01 -0500 (EST)
+       (envelope-from pgman@candle.pha.pa.us)
+Received: (from pgman@localhost)
+       by candle.pha.pa.us (8.11.6/8.10.1) id g0PNvoL02515;
+       Fri, 25 Jan 2002 18:57:50 -0500 (EST)
+From: Bruce Momjian <pgman@candle.pha.pa.us>
+Message-ID: <200201252357.g0PNvoL02515@candle.pha.pa.us>
+Subject: Re: [HACKERS] Savepoints
+In-Reply-To: <46C15C39FEB2C44BA555E356FBCD6FA41EB4C4@m0114.s-mxs.net>
+To: Zeugswetter Andreas SB SD <ZeugswetterA@spardat.at>
+Date: Fri, 25 Jan 2002 18:57:50 -0500 (EST)
+cc: "Mikheev, Vadim" <vmikheev@SECTORBASE.COM>,
+   PostgreSQL-development <pgsql-hackers@postgresql.org>
+X-Mailer: ELM [version 2.4ME+ PL96 (25)]
+MIME-Version: 1.0
+Content-Transfer-Encoding: 7bit
+Content-Type: text/plain; charset=US-ASCII
+Precedence: bulk
+Sender: pgsql-hackers-owner@postgresql.org
+Status: OR
+
+Zeugswetter Andreas SB SD wrote:
+> 
+> > Now, with MVCC, the backend has to read through the redo segment to get
+> 
+> You mean rollback segment, but ...
+
+
+Sorry, yes.  I get redo/undo/rollback mixed up sometimes. :-)
+
+> > the original data value for that row.
+> 
+> Will only need to be looked up if the row is currently beeing modified by 
+> a not yet comitted txn (at least in the default read committed mode)  
+
+Uh, not really.  The transaction may have completed after my transaction
+started, meaning even though it looks like it is committed, to me, it is
+not visible.  Most MVCC visibility will require undo lookup.
+
+> 
+> > 
+> > Now, while rollback segments do help with cleaning out old UPDATE rows,
+> > how does it improve DELETE performance?  Seems it would just mark it as
+> > expired like we do now.
+> 
+> delete would probably be: 
+> 1. mark original deleted and write whole row to RS
+> 
+> I don't think you would like to mix looking up deleted rows in heap
+> but updated rows in RS
+
+Yes, so really the overwriting is only a big win for UPDATE.  Right now,
+UPDATE is DELETE/INSERT, and that DELETE makes MVCC happy. :-)
+
+My whole goal was to simplify this so we can see the differences.
+
+
+> PS: not that I like overwrite with MVCC now
+> If you think of VACUUM as garbage collection PG is highly trendy with
+> the non-overwriting smgr.
+
+Yes, that is basically what it is now, a garbage collector that collects
+in heap rather than in undo.
+
+-- 
+  Bruce Momjian                        |  http://candle.pha.pa.us
+  pgman@candle.pha.pa.us               |  (610) 853-3000
+  +  If your life is a hard drive,     |  830 Blythe Avenue
+  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
+
+---------------------------(end of broadcast)---------------------------
+TIP 3: if posting/reading through Usenet, please send an appropriate
+subscribe-nomail command to majordomo@postgresql.org so that your
+message can get through to the mailing list cleanly
+
+From pgman Wed Jan 23 10:36:13 2002
+Subject: Savepoints
+To: PostgreSQL-development <pgsql-hackers@postgreSQL.org>
+Date: Wed, 23 Jan 2002 13:19:05 -0500 (EST)
+X-Mailer: ELM [version 2.4ME+ PL96 (25)]
+MIME-Version: 1.0
+Content-Transfer-Encoding: 7bit
+Content-Type: text/plain; charset=US-ASCII
+Content-Length:  1829
+Status: OR
+
+I have talked in the past about a possible implementation of
+savepoints/nested transactions.  I would like to more formally outline
+my ideas below.
+
+We have talked about using WAL for such a purpose, but that requires WAL
+files to remain for the life of a transaction, which seems unacceptable.
+Other database systems do that, and it is a pain for administrators.  I
+realized we could do some sort of WAL compaction, but that seems quite
+complex too.
+
+Basically, under my plan, WAL would be unchanged.  WAL's function is
+crash recovery, and it would retain that.  There would also be no
+on-disk changes.  I would use the command counter in certain cases to
+identify savepoints.
+
+My idea is to keep savepoint undo information in a private area per
+backend, either in memory or on disk.  We can either save the
+relid/tids of modified rows, or if there are too many, discard the
+saved ones and just remember the modified relids.  On rollback to save
+point, either clear up the modified relid/tids, or sequential scan
+through the relid and clear up all the tuples that have our transaction
+id and have command counters that are part of the undo savepoint.
+
+It seems marking undo savepoint rows with a fixed aborted transaction id
+would be the easiest solution.
+
+Of course, we only remember modified rows when we are in savepoints, and
+only undo them when we rollback to a savepoint.  Transaction processing
+remains the same.
+
+There is no reason for other backend to be able to see savepoint undo
+information, and keeping it private greatly simplifies the
+implementation.
+
+-- 
+  Bruce Momjian                        |  http://candle.pha.pa.us
+  pgman@candle.pha.pa.us               |  (610) 853-3000
+  +  If your life is a hard drive,     |  830 Blythe Avenue
+  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
+
author	Bruce Momjian <bruce@momjian.us>
	Wed, 20 Feb 2002 00:44:30 +0000 (00:44 +0000)
committer	Bruce Momjian <bruce@momjian.us>
	Wed, 20 Feb 2002 00:44:30 +0000 (00:44 +0000)