From: Bruce Momjian Date: Fri, 3 Jun 2005 21:46:02 +0000 (+0000) Subject: Put back WAL TODO.detail entries. X-Git-Tag: REL8_1_0BETA1~664 X-Git-Url: https://granicus.if.org/sourcecode?a=commitdiff_plain;h=7e209f6ce3a4eb2cba877e5e76597ecdb1596c2e;p=postgresql Put back WAL TODO.detail entries. --- diff --git a/doc/TODO.detail/wal b/doc/TODO.detail/wal index 37d2edd432..244f4f7519 100644 --- a/doc/TODO.detail/wal +++ b/doc/TODO.detail/wal @@ -1,3 +1,84 @@ +From cjs@cynic.net Thu Jun 20 22:18:27 2002 +Return-path: +Received: from academic.cynic.net (academic.cynic.net [63.144.177.3]) + by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5L2IPo22195 + for ; Thu, 20 Jun 2002 22:18:26 -0400 (EDT) +Received: from angelic-academic.cvpn.cynic.net (angelic-academic.cvpn.cynic.net [198.73.220.224]) + by academic.cynic.net (Postfix) with ESMTP + id 88216F821; Fri, 21 Jun 2002 02:18:17 +0000 (UTC) +Date: Fri, 21 Jun 2002 11:18:14 +0900 (JST) +From: Curt Sampson +To: Bruce Momjian +cc: Michael Loftis , mlw , + PostgreSQL-development +Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE +In-Reply-To: <200206210158.g5L1wFk20118@candle.pha.pa.us> +Message-ID: +MIME-Version: 1.0 +Content-Type: TEXT/PLAIN; charset=US-ASCII +Status: OR + +On Thu, 20 Jun 2002, Bruce Momjian wrote: + +> > MS SQL Server has an interesting way of dealing with this. They have a +> > "torn" bit in each 512-byte chunk of a page, and this bit is set the +> > same for each chunk. When they are about to write out a page, they first +> > flip all of the torn bits and then do the write. If the write does not +> > complete due to a system crash or whatever, this can be detected later +> > because the torn bits won't match across the entire page. +> +> I was wondering, how does knowing the block is corrupt help MS SQL? + +I'm trying to recall, but I can't off hand. I'll have to look it +up in my Inside SQL Server book, which is at home right now, +unfortunately. I'll bring the book into work and let you know the +details later. + +> Right now, we write changed pages to WAL, then later write them to disk. + +Ah. You write the entire page? MS writes only the changed tuple. +And DB2, in fact, goes one better and writes only the part of the +tuple up to the change, IIRC. Thus, if you put smaller and/or more +frequently changed columns first, you'll have smaller logs. + +> I have always been looking for a way to prevent these WAL writes. The +> 512-byte bit seems interesting, but how does it help? + +Well, this would at least let you reduce the write to the 512-byte +chunk that changed, rather than writing the entire 8K page. + +> And how does the bit help them with partial block writes? Is the bit at +> the end of the block? Is that reliable? + +The bit is somewhere within every 512 byte "disk page" within the +8192 byte "filesystem/database page." So an 8KB page is divided up +like this: + + | <----------------------- 8 Kb ----------------------> | + + | 512b | 512b | 512b | 512b | 512b | 512b | 512b | 512b | + +Thus, the tear bits start out like this: + + | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | + +After a successful write of the entire page, you have this: + + | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | + +If the write is unsuccessful, you end up with something like this: + + | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | + +And now you know which parts of your page got written, and which +parts didn't. + +cjs +-- +Curt Sampson +81 90 7737 2974 http://www.netbsd.org + Don't you know, in this new Dark Age, we're all light. --XTC + + From cjs@cynic.net Sat Jun 22 04:41:54 2002 Return-path: Received: from academic.cynic.net ([63.144.177.3]) @@ -17,7 +98,7 @@ In-Reply-To: <19332.1024668861@sss.pgh.pa.us> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII -Status: RO +Status: OR On Fri, 21 Jun 2002, Tom Lane wrote: @@ -68,6 +149,681 @@ Curt Sampson +81 90 7737 2974 http://www.netbsd.org Don't you know, in this new Dark Age, we're all light. --XTC +From pgsql-hackers-owner+M24060@postgresql.org Sat Jun 22 18:31:21 2002 +Return-path: +Received: from postgresql.org (postgresql.org [64.49.215.8]) + by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5MMVKF20014 + for ; Sat, 22 Jun 2002 18:31:20 -0400 (EDT) +Received: from localhost.localdomain (postgresql.org [64.49.215.8]) + by localhost (Postfix) with ESMTP + id 0ADFE476090; Sat, 22 Jun 2002 18:31:10 -0400 (EDT) +Received: from postgresql.org (postgresql.org [64.49.215.8]) + by postgresql.org (Postfix) with SMTP + id 6B372475A96; Sat, 22 Jun 2002 18:28:42 -0400 (EDT) +Received: from localhost.localdomain (postgresql.org [64.49.215.8]) + by localhost (Postfix) with ESMTP id 47AD2475935 + for ; Sat, 22 Jun 2002 18:28:40 -0400 (EDT) +Received: from hades.usol.com (hades.usol.com [208.232.58.41]) + by postgresql.org (Postfix) with ESMTP id 1D5DA476166 + for ; Sat, 22 Jun 2002 18:23:16 -0400 (EDT) +Received: from 01-081.024.popsite.net (01-081.024.popsite.net [216.126.160.81]) + by hades.usol.com (8.11.6/8.11.6) with ESMTP id g5MMMOj11344; + Sat, 22 Jun 2002 18:22:25 -0400 +Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE +From: "J. R. Nield" +To: Bruce Momjian +cc: Curt Sampson , Michael Loftis , + mlw , + PostgreSQL Hacker , + Tom Lane +In-Reply-To: <200206210158.g5L1wFk20118@candle.pha.pa.us> +References: <200206210158.g5L1wFk20118@candle.pha.pa.us> +Content-Type: text/plain +Content-Transfer-Encoding: 7bit +Message-ID: <1024784514.1793.242.camel@localhost.localdomain> +MIME-Version: 1.0 +X-Mailer: Ximian Evolution 1.0.3 (1.0.3-6) +Date: 22 Jun 2002 18:22:58 -0400 +Precedence: bulk +Sender: pgsql-hackers-owner@postgresql.org +Status: ORr + +On Thu, 2002-06-20 at 21:58, Bruce Momjian wrote: +> I was wondering, how does knowing the block is corrupt help MS SQL? +> Right now, we write changed pages to WAL, then later write them to disk. +> I have always been looking for a way to prevent these WAL writes. The +> 512-byte bit seems interesting, but how does it help? +> +> And how does the bit help them with partial block writes? Is the bit at +> the end of the block? Is that reliable? +> + +My understanding of this is as follows: + +1) On most commercial systems, if you get a corrupted block (from +partial write or whatever) you need to restore the file(s) from the most +recent backup, and replay the log from the log archive (usually only the +damaged files will be written to during replay). + +2) If you can't deal with the downtime to recover the file, then EMC, +Sun, or IBM will sell you an expensive disk array with an NVRAM cache +that will do atomic writes. Some plain-vanilla SCSI disks are also +capable of atomic writes, though usually they don't use NVRAM to do it. + +The database must then make sure that each page-write gets translated +into exactly one SCSI-level write. This is one reason why ORACLE and +Sybase recommend that you use raw disk partitions for high availability. +Some operating systems support this through the filesystem, but it is OS +dependent. I think Solaris 7 & 8 has support for this, but I'm not sure. + +PostgreSQL has trouble because it can neither archive logs for replay, +nor use raw disk partitions. + + +One other point: + +Page pre-image logging is fundamentally the same as what Jim Grey's +book[1] would call "careful writes". I don't believe they should be in +the XLOG, because we never need to keep the pre-images after we're sure +the buffer has made it to the disk. Instead, we should have the buffer +IO routines implement ping-pong writes of some kind if we want +protection from partial writes. + + +Does any of this make sense? + + + +;jrnield + + +[1] Grey, J. and Reuter, A. (1993). "Transaction Processing: Concepts + and Techniques". Morgan Kaufmann. + +-- +J. R. Nield +jrnield@usol.com + + + + +---------------------------(end of broadcast)--------------------------- +TIP 4: Don't 'kill -9' the postmaster + +From pgsql-hackers-owner+M24068@postgresql.org Sun Jun 23 08:40:27 2002 +Return-path: +Received: from postgresql.org (postgresql.org [64.49.215.8]) + by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5NCeQF01601 + for ; Sun, 23 Jun 2002 08:40:27 -0400 (EDT) +Received: from localhost.localdomain (postgresql.org [64.49.215.8]) + by localhost (Postfix) with ESMTP + id 8AC4B475CBC; Sun, 23 Jun 2002 08:40:22 -0400 (EDT) +Received: from postgresql.org (postgresql.org [64.49.215.8]) + by postgresql.org (Postfix) with SMTP + id 4683647599D; Sun, 23 Jun 2002 08:37:40 -0400 (EDT) +Received: from localhost.localdomain (postgresql.org [64.49.215.8]) + by localhost (Postfix) with ESMTP id 0D57847592A + for ; Sun, 23 Jun 2002 08:37:38 -0400 (EDT) +Received: from hades.usol.com (hades.usol.com [208.232.58.41]) + by postgresql.org (Postfix) with ESMTP id 75326475876 + for ; Sun, 23 Jun 2002 08:37:36 -0400 (EDT) +Received: from 08-032.024.popsite.net (08-032.024.popsite.net [66.19.4.32]) + by hades.usol.com (8.11.6/8.11.6) with ESMTP id g5NCbNj02111; + Sun, 23 Jun 2002 08:37:23 -0400 +Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE +From: "J. R. Nield" +To: Bruce Momjian +cc: Curt Sampson , Michael Loftis , + mlw , + PostgreSQL Hacker , + Tom Lane +In-Reply-To: <200206222317.g5MNHBn23427@candle.pha.pa.us> +References: <200206222317.g5MNHBn23427@candle.pha.pa.us> +Content-Type: text/plain +Content-Transfer-Encoding: 7bit +X-Mailer: Ximian Evolution 1.0.3 (1.0.3-6) +Date: 23 Jun 2002 08:37:53 -0400 +Message-ID: <1024835880.1793.264.camel@localhost.localdomain> +MIME-Version: 1.0 +Precedence: bulk +Sender: pgsql-hackers-owner@postgresql.org +Status: OR + +On Sat, 2002-06-22 at 19:17, Bruce Momjian wrote: +> J. R. Nield wrote: +> > One other point: +> > +> > Page pre-image logging is fundamentally the same as what Jim Grey's +> > book[1] would call "careful writes". I don't believe they should be in +> > the XLOG, because we never need to keep the pre-images after we're sure +> > the buffer has made it to the disk. Instead, we should have the buffer +> > IO routines implement ping-pong writes of some kind if we want +> > protection from partial writes. +> +> Ping-pong writes to where? We have to fsync, and rather than fsync that +> area and WAL, we just do WAL. Not sure about a win there. +> + +The key question is: do we have some method to ensure that the OS +doesn't do the writes in parallel? + +If the OS will ensure that one of the two block writes of a ping-pong +completes before the other starts, then we don't need to fsync() at +all. + +The only thing we are protecting against is the possibility of both +writes being partial. If neither is done, that's fine because WAL will +protect us. If the first write is partial, we will detect that and use +the old data from the other, then recover from WAL. If the first is +complete but the second is partial, then we detect that and use the +newer block from the first write. If the second is complete but the +first is partial, we detect that and use the newer block from the second +write. + +So does anyone know a way to prevent parallel writes in one of the +common unix standards? Do they say anything about this? + +It would seem to me that if the same process does both ping-pong writes, +then there should be a cheap way to enforce a serial order. I could be +wrong though. + +As to where the first block of the ping-pong should go, maybe we could +reserve a file with nBlocks space for them, and write the information +about which block was being written to the XLOG for use in recovery. +There are many other ways to do it. + +;jrnield + +-- +J. R. Nield +jrnield@usol.com + + + + +---------------------------(end of broadcast)--------------------------- +TIP 6: Have you searched our list archives? + +http://archives.postgresql.org + +From jrnield@usol.com Sun Jun 23 08:37:30 2002 +Return-path: +Received: from hades.usol.com (IDENT:root@hades.usol.com [208.232.58.41]) + by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5NCbRF28741 + for ; Sun, 23 Jun 2002 08:37:28 -0400 (EDT) +Received: from 08-032.024.popsite.net (08-032.024.popsite.net [66.19.4.32]) + by hades.usol.com (8.11.6/8.11.6) with ESMTP id g5NCbNj02111; + Sun, 23 Jun 2002 08:37:23 -0400 +Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE +From: "J. R. Nield" +To: Bruce Momjian +cc: Curt Sampson , Michael Loftis , + mlw + , + PostgreSQL Hacker , + Tom Lane +In-Reply-To: <200206222317.g5MNHBn23427@candle.pha.pa.us> +References: <200206222317.g5MNHBn23427@candle.pha.pa.us> +Content-Type: text/plain +Content-Transfer-Encoding: 7bit +X-Mailer: Ximian Evolution 1.0.3 (1.0.3-6) +Date: 23 Jun 2002 08:37:53 -0400 +Message-ID: <1024835880.1793.264.camel@localhost.localdomain> +MIME-Version: 1.0 +Status: OR + +On Sat, 2002-06-22 at 19:17, Bruce Momjian wrote: +> J. R. Nield wrote: +> > One other point: +> > +> > Page pre-image logging is fundamentally the same as what Jim Grey's +> > book[1] would call "careful writes". I don't believe they should be in +> > the XLOG, because we never need to keep the pre-images after we're sure +> > the buffer has made it to the disk. Instead, we should have the buffer +> > IO routines implement ping-pong writes of some kind if we want +> > protection from partial writes. +> +> Ping-pong writes to where? We have to fsync, and rather than fsync that +> area and WAL, we just do WAL. Not sure about a win there. +> + +The key question is: do we have some method to ensure that the OS +doesn't do the writes in parallel? + +If the OS will ensure that one of the two block writes of a ping-pong +completes before the other starts, then we don't need to fsync() at +all. + +The only thing we are protecting against is the possibility of both +writes being partial. If neither is done, that's fine because WAL will +protect us. If the first write is partial, we will detect that and use +the old data from the other, then recover from WAL. If the first is +complete but the second is partial, then we detect that and use the +newer block from the first write. If the second is complete but the +first is partial, we detect that and use the newer block from the second +write. + +So does anyone know a way to prevent parallel writes in one of the +common unix standards? Do they say anything about this? + +It would seem to me that if the same process does both ping-pong writes, +then there should be a cheap way to enforce a serial order. I could be +wrong though. + +As to where the first block of the ping-pong should go, maybe we could +reserve a file with nBlocks space for them, and write the information +about which block was being written to the XLOG for use in recovery. +There are many other ways to do it. + +;jrnield + +-- +J. R. Nield +jrnield@usol.com + + + + +From cjs@cynic.net Sun Jun 23 09:33:29 2002 +Return-path: +Received: from academic.cynic.net (academic.cynic.net [63.144.177.3]) + by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5NDXSF11543 + for ; Sun, 23 Jun 2002 09:33:28 -0400 (EDT) +Received: from angelic-academic.cvpn.cynic.net (angelic-academic.cvpn.cynic.net [198.73.220.224]) + by academic.cynic.net (Postfix) with ESMTP + id A83ABF820; Sun, 23 Jun 2002 13:33:15 +0000 (UTC) +Date: Sun, 23 Jun 2002 22:33:07 +0900 (JST) +From: Curt Sampson +To: "J. R. Nield" +cc: Bruce Momjian , Michael Loftis , + mlw , + PostgreSQL Hacker , + Tom Lane +Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE +In-Reply-To: <1024835880.1793.264.camel@localhost.localdomain> +Message-ID: +MIME-Version: 1.0 +Content-Type: TEXT/PLAIN; charset=US-ASCII +Status: OR + +On 23 Jun 2002, J. R. Nield wrote: + +> On Sat, 2002-06-22 at 19:17, Bruce Momjian wrote: +> > J. R. Nield wrote: +> > > One other point: +> > > +> > > Page pre-image logging is fundamentally the same as what Jim Grey's +> > > book[1] would call "careful writes". I don't believe they should be in +> > > the XLOG, because we never need to keep the pre-images after we're sure +> > > the buffer has made it to the disk. Instead, we should have the buffer +> > > IO routines implement ping-pong writes of some kind if we want +> > > protection from partial writes. +> > +> > Ping-pong writes to where? We have to fsync, and rather than fsync that +> > area and WAL, we just do WAL. Not sure about a win there. + +Presumably the win is that, "we never need to keep the pre-images +after we're sure the buffer has made it to the disk." So the +pre-image log can be completely ditched when we shut down the +server, so a full system sync, or whatever. This keeps the log file +size down, which means faster recovery, less to back up (when we +start getting transaction logs that can be backed up), etc. + +This should also allow us to disable completely the ping-pong writes +if we have a disk subsystem that we trust. (E.g., a disk array with +battery backed memory.) That would, in theory, produce a nice little +performance increase when lots of inserts and/or updates are being +committed, as we have much, much less to write to the log file. + +Are there stats that track, e.g., the bandwidth of writes to the +log file? I'd be interested in knowing just what kind of savings +one might see by doing this. + +> The key question is: do we have some method to ensure that the OS +> doesn't do the writes in parallel?... +> It would seem to me that if the same process does both ping-pong writes, +> then there should be a cheap way to enforce a serial order. I could be +> wrong though. + +Well, whether or not there's a cheap way depends on whether you consider +fsync to be cheap. :-) + +cjs +-- +Curt Sampson +81 90 7737 2974 http://www.netbsd.org + Don't you know, in this new Dark Age, we're all light. --XTC + + +From pgsql-hackers-owner+M24073@postgresql.org Sun Jun 23 11:19:59 2002 +Return-path: +Received: from postgresql.org (postgresql.org [64.49.215.8]) + by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5NFJxF19785 + for ; Sun, 23 Jun 2002 11:19:59 -0400 (EDT) +Received: from localhost.localdomain (postgresql.org [64.49.215.8]) + by localhost (Postfix) with ESMTP + id 0BD5B475E79; Sun, 23 Jun 2002 11:19:55 -0400 (EDT) +Received: from postgresql.org (postgresql.org [64.49.215.8]) + by postgresql.org (Postfix) with SMTP + id 5C0CB475D6A; Sun, 23 Jun 2002 11:19:50 -0400 (EDT) +Received: from localhost.localdomain (postgresql.org [64.49.215.8]) + by localhost (Postfix) with ESMTP id E2353475C4B + for ; Sun, 23 Jun 2002 11:19:47 -0400 (EDT) +Received: from sss.pgh.pa.us (unknown [192.204.191.242]) + by postgresql.org (Postfix) with ESMTP id 746F8475AEA + for ; Sun, 23 Jun 2002 11:19:46 -0400 (EDT) +Received: from sss2.sss.pgh.pa.us (tgl@localhost [127.0.0.1]) + by sss.pgh.pa.us (8.11.4/8.11.4) with ESMTP id g5NFJF108464; + Sun, 23 Jun 2002 11:19:15 -0400 (EDT) +To: Curt Sampson +cc: "J. R. Nield" , Bruce Momjian , + Michael Loftis , mlw , + PostgreSQL Hacker +Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE +In-Reply-To: +References: +Comments: In-reply-to Curt Sampson + message dated "Sun, 23 Jun 2002 22:33:07 +0900" +Date: Sun, 23 Jun 2002 11:19:15 -0400 +Message-ID: <8461.1024845555@sss.pgh.pa.us> +From: Tom Lane +Precedence: bulk +Sender: pgsql-hackers-owner@postgresql.org +Status: OR + +Curt Sampson writes: +> This should also allow us to disable completely the ping-pong writes +> if we have a disk subsystem that we trust. + +If we have a disk subsystem we trust, we just disable fsync on the +WAL and the performance issue largely goes away. + +I concur with Bruce: the reason we keep page images in WAL is to +minimize the number of places we have to fsync, and thus the amount of +head movement required for a commit. Putting the page images elsewhere +cannot be a win AFAICS. + +> Well, whether or not there's a cheap way depends on whether you consider +> fsync to be cheap. :-) + +It's never cheap :-( + + regards, tom lane + +---------------------------(end of broadcast)--------------------------- +TIP 5: Have you checked our extensive FAQ? + +http://www.postgresql.org/users-lounge/docs/faq.html + +From cjs@cynic.net Sun Jun 23 12:10:44 2002 +Return-path: +Received: from academic.cynic.net (academic.cynic.net [63.144.177.3]) + by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5NGAgF22907 + for ; Sun, 23 Jun 2002 12:10:43 -0400 (EDT) +Received: from angelic-academic.cvpn.cynic.net (angelic-academic.cvpn.cynic.net [198.73.220.224]) + by academic.cynic.net (Postfix) with ESMTP + id 57BFDF820; Sun, 23 Jun 2002 16:10:35 +0000 (UTC) +Date: Mon, 24 Jun 2002 01:10:26 +0900 (JST) +From: Curt Sampson +To: Tom Lane +cc: "J. R. Nield" , Bruce Momjian , + Michael Loftis , mlw , + PostgreSQL Hacker +Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE +In-Reply-To: <8461.1024845555@sss.pgh.pa.us> +Message-ID: +MIME-Version: 1.0 +Content-Type: TEXT/PLAIN; charset=US-ASCII +Status: OR + +On Sun, 23 Jun 2002, Tom Lane wrote: + +> Curt Sampson writes: +> > This should also allow us to disable completely the ping-pong writes +> > if we have a disk subsystem that we trust. +> +> If we have a disk subsystem we trust, we just disable fsync on the +> WAL and the performance issue largely goes away. + +No, you can't do this. If you don't fsync(), there's no guarantee +that the write ever got out of the computer's buffer cache and to +the disk subsystem in the first place. + +> I concur with Bruce: the reason we keep page images in WAL is to +> minimize the number of places we have to fsync, and thus the amount of +> head movement required for a commit. + +An fsync() does not necessarially cause head movement, or any real +disk writes at all. If you're writing to many external disk arrays, +for example, the fsync() ensures that the data are in the disk array's +non-volatile or UPS-backed RAM, no more. The array might hold the data +for quite some time before it actually writes it to disk. + +But you're right that it's faster, if you're going to write out changed +pages and have have the ping-pong file and the transaction log on the +same disk, just to write out the entire page to the transaction log. + +So what we would really need to implement, if we wanted to be more +efficient with trusted disk subsystems, would be the option of writing +to the log only the changed row or changed part of the row, or writing +the entire changed page. I don't know how hard this would be.... + +> > Well, whether or not there's a cheap way depends on whether you consider +> > fsync to be cheap. :-) +> +> It's never cheap :-( + +Actually, with a good external RAID system with non-volatile RAM, +it's a good two to four orders of magnitude cheaper than writing to a +directly connected disk that doesn't claim the write is complete until +it's physically on disk. I'd say that it qualifies as at least "not +expensive." Not that you want to do it more often than you have to +anyway.... + +cjs +-- +Curt Sampson +81 90 7737 2974 http://www.netbsd.org + Don't you know, in this new Dark Age, we're all light. --XTC + + +From jrnield@usol.com Sun Jun 23 13:56:59 2002 +Return-path: +Received: from hades.usol.com (IDENT:root@hades.usol.com [208.232.58.41]) + by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5NHusF00335 + for ; Sun, 23 Jun 2002 13:56:58 -0400 (EDT) +Received: from 04-077.024.popsite.net (04-077.024.popsite.net [216.126.163.77]) + by hades.usol.com (8.11.6/8.11.6) with ESMTP id g5NHunj18549; + Sun, 23 Jun 2002 13:56:49 -0400 +Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE +From: "J. R. Nield" +To: Tom Lane +cc: Curt Sampson , Bruce Momjian , + Michael Loftis , mlw , + PostgreSQL Hacker + +In-Reply-To: <8461.1024845555@sss.pgh.pa.us> +References: + <8461.1024845555@sss.pgh.pa.us> +Content-Type: text/plain +Content-Transfer-Encoding: 7bit +X-Mailer: Ximian Evolution 1.0.3 (1.0.3-6) +Date: 23 Jun 2002 13:57:19 -0400 +Message-ID: <1024855044.1793.414.camel@localhost.localdomain> +MIME-Version: 1.0 +Status: ORr + +On Sun, 2002-06-23 at 11:19, Tom Lane wrote: +> Curt Sampson writes: +> > This should also allow us to disable completely the ping-pong writes +> > if we have a disk subsystem that we trust. +> +> If we have a disk subsystem we trust, we just disable fsync on the +> WAL and the performance issue largely goes away. + +It wouldn't work because the OS buffering interferes, and we need those +WAL records on disk up to the greatest LSN of the Buffer we will be writing. + + +We already buffer WAL ourselves. We also already buffer regular pages. +Whenever we write a Buffer out of the buffer cache, it is because we +really want that page on disk and wanted to start an IO. If thats not +the case, then we should have more block buffers! + +So since we have all this buffering designed especially to meet our +needs, and since the OS buffering is in the way, can someone explain to +me why postgresql would ever open a file without the O_DSYNC flag if the +platform supports it? + + + +> +> I concur with Bruce: the reason we keep page images in WAL is to +> minimize the number of places we have to fsync, and thus the amount of +> head movement required for a commit. Putting the page images elsewhere +> cannot be a win AFAICS. + + +Why not put all the page images in a single pre-allocated file and treat +it as a ring? How could this be any worse than flushing them in the WAL +log? + +Maybe fsync would be slower with two files, but I don't see how +fdatasync would be, and most platforms support that. + +What would improve performance would be to have a dbflush process that +would work in the background flushing buffers in groups and trying to +stay ahead of ReadBuffer requests. That would let you do the temporary +side of the ping-pong as a huge O_DSYNC writev(2) request (or +fdatasync() once) and then write out the other buffers. It would also +tend to prevent the other backends from blocking on write requests. + +A dbflush could also support aio_read/aio_write on platforms like +Solaris and WindowsNT that support it. + +Am I correct that right now, buffers only get written when they get +removed from the free list for reuse? So a released dirty buffer will +sit in the buffer free list until it becomes the Least Recently Used +buffer, and will then cause a backend to block for IO in a call to +BufferAlloc? + +This would explain why we like using the OS buffer cache, and why our +performance is troublesome when we have to do synchronous IO writes, and +why fsync() takes so long to complete. All of the backends block for +each call to BufferAlloc() after a large table update by a single +backend, and then the OS buffers are always full of our "written" data. + +Am I reading the bufmgr code correctly? I already found an imaginary +race condition there once :-) + +;jnield + + +> +> > Well, whether or not there's a cheap way depends on whether you consider +> > fsync to be cheap. :-) +> +> It's never cheap :-( +> +-- +J. R. Nield +jrnield@usol.com + + +From cjs@cynic.net Sun Jun 23 14:15:15 2002 +Return-path: +Received: from academic.cynic.net (academic.cynic.net [63.144.177.3]) + by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5NIFEF01698 + for ; Sun, 23 Jun 2002 14:15:15 -0400 (EDT) +Received: from angelic-academic.cvpn.cynic.net (angelic-academic.cvpn.cynic.net [198.73.220.224]) + by academic.cynic.net (Postfix) with ESMTP + id 796E6F820; Sun, 23 Jun 2002 18:15:08 +0000 (UTC) +Date: Mon, 24 Jun 2002 03:15:01 +0900 (JST) +From: Curt Sampson +To: "J. R. Nield" +cc: Tom Lane , Bruce Momjian , + Michael Loftis , mlw , + PostgreSQL Hacker +Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE +In-Reply-To: <1024855044.1793.414.camel@localhost.localdomain> +Message-ID: +MIME-Version: 1.0 +Content-Type: TEXT/PLAIN; charset=US-ASCII +Status: ORr + +On 23 Jun 2002, J. R. Nield wrote: + +> So since we have all this buffering designed especially to meet our +> needs, and since the OS buffering is in the way, can someone explain to +> me why postgresql would ever open a file without the O_DSYNC flag if the +> platform supports it? + +It's more code, if there are platforms out there that don't support +O_DYSNC. (We still have to keep the old fsync code.) On the other hand, +O_DSYNC could save us a disk arm movement over fsync() because it +appears to me that fsync is also going to force a metadata update, which +means that the inode blocks have to be written as well. + +> Maybe fsync would be slower with two files, but I don't see how +> fdatasync would be, and most platforms support that. + +Because, if both files are on the same disk, you still have to move +the disk arm from the cylinder at the current log file write point +to the cylinder at the current ping-pong file write point. And then back +again to the log file write point cylinder. + +In the end, having a ping-pong file as well seems to me unnecessary +complexity, especially when anyone interested in really good +performance is going to buy a disk subsystem that guarantees no +torn pages and thus will want to turn off the ping-pong file writes +entirely, anyway. + +cjs +-- +Curt Sampson +81 90 7737 2974 http://www.netbsd.org + Don't you know, in this new Dark Age, we're all light. --XTC + + +From jrnield@usol.com Sun Jun 23 14:14:51 2002 +Return-path: +Received: from hades.usol.com (IDENT:root@hades.usol.com [208.232.58.41]) + by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5NIEnF01649 + for ; Sun, 23 Jun 2002 14:14:50 -0400 (EDT) +Received: from 04-077.024.popsite.net (04-077.024.popsite.net [216.126.163.77]) + by hades.usol.com (8.11.6/8.11.6) with ESMTP id g5NIEkj19287; + Sun, 23 Jun 2002 14:14:46 -0400 +Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE +From: "J. R. Nield" +To: Curt Sampson +cc: Tom Lane , Bruce Momjian , + Michael Loftis , mlw , + PostgreSQL Hacker + +In-Reply-To: +References: +Content-Type: text/plain +Content-Transfer-Encoding: 7bit +X-Mailer: Ximian Evolution 1.0.3 (1.0.3-6) +Date: 23 Jun 2002 14:15:17 -0400 +Message-ID: <1024856120.3054.418.camel@localhost.localdomain> +MIME-Version: 1.0 +Status: OR + +On Sun, 2002-06-23 at 12:10, Curt Sampson wrote: +> +> So what we would really need to implement, if we wanted to be more +> efficient with trusted disk subsystems, would be the option of writing +> to the log only the changed row or changed part of the row, or writing +> the entire changed page. I don't know how hard this would be.... +> +We already log that stuff. The page images are in addition to the +"Logical Changes", so we could just stop logging the page images. + +-- +J. R. Nield +jrnield@usol.com + + + + From pgsql-hackers-owner+M24100@postgresql.org Mon Jun 24 13:13:41 2002 Return-path: Received: from postgresql.org (postgresql.org [64.49.215.8]) @@ -108,7 +864,7 @@ Sender: pgsql-hackers-owner@postgresql.org X-Spam-Status: No, hits=-3.4 required=5.0 tests=IN_REP_TO version=2.30 -Status: RO +Status: OR J. R. Nield wrote: > So since we have all this buffering designed especially to meet our @@ -129,53 +885,298 @@ during checkpoint when we discard old WAL files. > it as a ring? How could this be any worse than flushing them in the WAL > log? > -> Maybe fsync would be slower with two files, but I don't see how -> fdatasync would be, and most platforms support that. +> Maybe fsync would be slower with two files, but I don't see how +> fdatasync would be, and most platforms support that. + +We have fdatasync option for WAL in postgresql.conf. + +-- + Bruce Momjian | http://candle.pha.pa.us + pgman@candle.pha.pa.us | (610) 853-3000 + + If your life is a hard drive, | 830 Blythe Avenue + + Christ can be your backup. | Drexel Hill, Pennsylvania 19026 + + + +---------------------------(end of broadcast)--------------------------- +TIP 4: Don't 'kill -9' the postmaster + + + +From pgsql-hackers-owner+M24091@postgresql.org Mon Jun 24 12:54:22 2002 +Return-path: +Received: from postgresql.org (postgresql.org [64.49.215.8]) + by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5OGsMF07208 + for ; Mon, 24 Jun 2002 12:54:22 -0400 (EDT) +Received: from localhost.localdomain (postgresql.org [64.49.215.8]) + by localhost (Postfix) with ESMTP + id 7DB7947679D; Mon, 24 Jun 2002 09:48:51 -0400 (EDT) +Mailbox-Line: From pgman@candle.pha.pa.us Mon Jun 24 09:48:51 2002 +Received: from postgresql.org (postgresql.org [64.49.215.8]) + by postgresql.org (Postfix) with SMTP + id 3FD37476491; Mon, 24 Jun 2002 08:55:34 -0400 (EDT) +Received: from localhost.localdomain (postgresql.org [64.49.215.8]) + by localhost (Postfix) with ESMTP id 2769E4762E3 + for ; Mon, 24 Jun 2002 08:27:39 -0400 (EDT) +Mailbox-Line: From pgman@candle.pha.pa.us Mon Jun 24 08:27:39 2002 +Received: from candle.pha.pa.us (216-55-132-35.dsl.san-diego.abac.net [216.55.132.35]) + by postgresql.org (Postfix) with ESMTP id ED459475C61 + for ; Sun, 23 Jun 2002 15:37:08 -0400 (EDT) +Received: (from pgman@localhost) + by candle.pha.pa.us (8.11.6/8.10.1) id g5NJasa07642; + Sun, 23 Jun 2002 15:36:54 -0400 (EDT) +From: Bruce Momjian +Message-ID: <200206231936.g5NJasa07642@candle.pha.pa.us> +Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE +In-Reply-To: +To: Curt Sampson +Date: Sun, 23 Jun 2002 15:36:54 -0400 (EDT) +cc: "J. R. Nield" , Tom Lane , + Michael Loftis , mlw , + PostgreSQL Hacker +X-Mailer: ELM [version 2.4ME+ PL97 (25)] +MIME-Version: 1.0 +Content-Transfer-Encoding: 7bit +Content-Type: text/plain; charset=US-ASCII +Precedence: bulk +Sender: pgsql-hackers-owner@postgresql.org +X-Spam-Status: No, hits=-3.4 required=5.0 + tests=IN_REP_TO + version=2.30 +Status: OR + +Curt Sampson wrote: +> On 23 Jun 2002, J. R. Nield wrote: +> +> > So since we have all this buffering designed especially to meet our +> > needs, and since the OS buffering is in the way, can someone explain to +> > me why postgresql would ever open a file without the O_DSYNC flag if the +> > platform supports it? +> +> It's more code, if there are platforms out there that don't support +> O_DYSNC. (We still have to keep the old fsync code.) On the other hand, +> O_DSYNC could save us a disk arm movement over fsync() because it +> appears to me that fsync is also going to force a metadata update, which +> means that the inode blocks have to be written as well. + +Again, see postgresql.conf: + +#wal_sync_method = fsync # the default varies across platforms: +# # fsync, fdatasync, open_sync, or open_datasync + +> +> > Maybe fsync would be slower with two files, but I don't see how +> > fdatasync would be, and most platforms support that. +> +> Because, if both files are on the same disk, you still have to move +> the disk arm from the cylinder at the current log file write point +> to the cylinder at the current ping-pong file write point. And then back +> again to the log file write point cylinder. +> +> In the end, having a ping-pong file as well seems to me unnecessary +> complexity, especially when anyone interested in really good +> performance is going to buy a disk subsystem that guarantees no +> torn pages and thus will want to turn off the ping-pong file writes +> entirely, anyway. + +Yes, I don't see writing to two files vs. one to be any win, especially +when we need to fsync both of them. What I would really like is to +avoid the double I/O of writing to WAL and to the data file; improving +that would be a huge win. + +-- + Bruce Momjian | http://candle.pha.pa.us + pgman@candle.pha.pa.us | (610) 853-3000 + + If your life is a hard drive, | 830 Blythe Avenue + + Christ can be your backup. | Drexel Hill, Pennsylvania 19026 + + + +---------------------------(end of broadcast)--------------------------- +TIP 4: Don't 'kill -9' the postmaster + + + +From cjs@cynic.net Sun Jun 23 20:09:44 2002 +Return-path: +Received: from academic.cynic.net (academic.cynic.net [63.144.177.3]) + by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5O09hF00630 + for ; Sun, 23 Jun 2002 20:09:43 -0400 (EDT) +Received: from angelic-academic.cvpn.cynic.net (angelic-academic.cvpn.cynic.net [198.73.220.224]) + by academic.cynic.net (Postfix) with ESMTP + id 6F45AF820; Mon, 24 Jun 2002 00:09:38 +0000 (UTC) +Date: Mon, 24 Jun 2002 09:09:30 +0900 (JST) +From: Curt Sampson +To: Bruce Momjian +cc: "J. R. Nield" , Tom Lane , + Michael Loftis , mlw , + PostgreSQL Hacker +Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE +In-Reply-To: <200206231936.g5NJasa07642@candle.pha.pa.us> +Message-ID: +MIME-Version: 1.0 +Content-Type: TEXT/PLAIN; charset=US-ASCII +Status: OR + +On Sun, 23 Jun 2002, Bruce Momjian wrote: + +> Yes, I don't see writing to two files vs. one to be any win, especially +> when we need to fsync both of them. What I would really like is to +> avoid the double I/O of writing to WAL and to the data file; improving +> that would be a huge win. + +You mean, the double I/O of writing the block to the WAL and data file? +(We'd still have to write the changed columns or whatever to the WAL, +right?) + +I'd just add an option to turn it off. If you need it, you need it; +there's no way around that except to buy hardware that is really going +to guarantee your writes (which then means you don't need it). + +cjs +-- +Curt Sampson +81 90 7737 2974 http://www.netbsd.org + Don't you know, in this new Dark Age, we're all light. --XTC + + +From jrnield@usol.com Sun Jun 23 21:28:58 2002 +Return-path: +Received: from hades.usol.com (IDENT:root@hades.usol.com [208.232.58.41]) + by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5O1SuF06381 + for ; Sun, 23 Jun 2002 21:28:57 -0400 (EDT) +Received: from 01-072.024.popsite.net (01-072.024.popsite.net [216.126.160.72]) + by hades.usol.com (8.11.6/8.11.6) with ESMTP id g5O1Ssj09303; + Sun, 23 Jun 2002 21:28:55 -0400 +Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE +From: "J. R. Nield" +To: Bruce Momjian +cc: Curt Sampson , Tom Lane , + Michael Loftis + , mlw , + PostgreSQL Hacker + +In-Reply-To: <200206231936.g5NJasa07642@candle.pha.pa.us> +References: <200206231936.g5NJasa07642@candle.pha.pa.us> +Content-Type: text/plain +Content-Transfer-Encoding: 7bit +X-Mailer: Ximian Evolution 1.0.3 (1.0.3-6) +Date: 23 Jun 2002 21:29:23 -0400 +Message-ID: <1024882167.1793.733.camel@localhost.localdomain> +MIME-Version: 1.0 +Status: ORr + +On Sun, 2002-06-23 at 15:36, Bruce Momjian wrote: +> Yes, I don't see writing to two files vs. one to be any win, especially +> when we need to fsync both of them. What I would really like is to +> avoid the double I/O of writing to WAL and to the data file; improving +> that would be a huge win. +> -We have fdatasync option for WAL in postgresql.conf. +If is impossible to do what you want. You can not protect against +partial writes without writing pages twice and calling fdatasync between +them while going through a generic filesystem. The best disk array will +not protect you if the operating system does not align block writes to +the structure of the underlying device. Even with raw devices, you need +special support or knowledge of the operating system and/or the disk +device to ensure that each write request will be atomic to the +underlying hardware. + +All other systems rely on the fact that you can recover a damaged file +using the log archive. This means downtime in the rare case, but no data +loss. Until PostgreSQL can do this, then it will not be acceptable for +real critical production use. This is not to knock PostgreSQL, because +it is a very good database system, and clearly the best open-source one. +It even has feature advantages over the commercial systems. But at the +end of the day, unless you have complete understanding of the I/O system +from write(2) through to the disk system, the only sure ways to protect +against partial writes are by "careful writes" (in the WAL log or +elsewhere, writing pages twice), or by requiring (and allowing) users to +do log-replay recovery when a file is corrupted by a partial write. As +long as there is a UPS, and the operating system doesn't crash, then +there still should be no partial writes. + +If we log pages to WAL, they are useless when archived (after a +checkpoint). So either we have a separate "log" for them (the ping-pong +file), or we should at least remove them when archived, which makes log +archiving more complex but is perfectly doable. + +Finally, I would love to hear why we are using the operating system +buffer manager at all. The OS is acting as a secondary buffer manager +for us. Why is that? What flaw in our I/O system does this reveal? I +know that: + +>We sync only WAL, not the other pages, except for the sync() call we do +> during checkpoint when we discard old WAL files. + +But this is probably not a good thing. We should only be writing blocks +when they need to be on disk. We should not be expecting the OS to write +them "sometime later" and avoid blocking (as long) for the write. If we +need that, then our buffer management is wrong and we need to fix it. +The reason we are doing this is because we expect the OS buffer manager +to do asynchronous I/O for us, but then we don't control the order. That +is the reason why we have to call fdatasync(), to create "sequence +points". + +The reason we have performance problems with either D_OSYNC or fdatasync +on the normal relations is because we have no dbflush process. This +causes an unacceptable amount of I/O blocking by other transactions. + +The ORACLE people were not kidding when they said that they could not +certify Linux for production use until it supported O_DSYNC. Can you +explain why that was the case? + +Finally, let me apologize if the above comes across as somewhat +belligerent. I know very well that I can't compete with you guys for +knowledge of the PosgreSQL system. I am still at a loss when I look at +the optimizer and executor modules, and it will take some time before I +can follow discussion of that area. Even then, I doubt my ability to +compare with people like Mr. Lane and Mr. Momjian in experience and +general intelligence, or in the field of database programming and +software development in particular. However, this discussion and a +search of the pgsql-hackers archives reveals this problem to be the KEY +area of PostgreSQL's failing, and general misunderstanding, when +compared to its commercial competitors. + +Sincerely, + + J. R. Nield -- - Bruce Momjian | http://candle.pha.pa.us - pgman@candle.pha.pa.us | (610) 853-3000 - + If your life is a hard drive, | 830 Blythe Avenue - + Christ can be your backup. | Drexel Hill, Pennsylvania 19026 - - +J. R. Nield +jrnield@usol.com ----------------------------(end of broadcast)--------------------------- -TIP 4: Don't 'kill -9' the postmaster -From pgsql-hackers-owner+M24091@postgresql.org Mon Jun 24 12:54:22 2002 -Return-path: +From pgsql-hackers-owner+M24090@postgresql.org Mon Jun 24 12:38:04 2002 +Return-path: Received: from postgresql.org (postgresql.org [64.49.215.8]) - by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5OGsMF07208 - for ; Mon, 24 Jun 2002 12:54:22 -0400 (EDT) + by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5OGc3F05962 + for ; Mon, 24 Jun 2002 12:38:03 -0400 (EDT) Received: from localhost.localdomain (postgresql.org [64.49.215.8]) by localhost (Postfix) with ESMTP - id 7DB7947679D; Mon, 24 Jun 2002 09:48:51 -0400 (EDT) -Mailbox-Line: From pgman@candle.pha.pa.us Mon Jun 24 09:48:51 2002 + id 81B9F4768DF; Mon, 24 Jun 2002 10:18:05 -0400 (EDT) +Mailbox-Line: From pgman@candle.pha.pa.us Mon Jun 24 10:18:05 2002 Received: from postgresql.org (postgresql.org [64.49.215.8]) by postgresql.org (Postfix) with SMTP - id 3FD37476491; Mon, 24 Jun 2002 08:55:34 -0400 (EDT) + id 81F08476473; Mon, 24 Jun 2002 08:55:28 -0400 (EDT) Received: from localhost.localdomain (postgresql.org [64.49.215.8]) - by localhost (Postfix) with ESMTP id 2769E4762E3 - for ; Mon, 24 Jun 2002 08:27:39 -0400 (EDT) -Mailbox-Line: From pgman@candle.pha.pa.us Mon Jun 24 08:27:39 2002 + by localhost (Postfix) with ESMTP id CDDFA475CC3 + for ; Mon, 24 Jun 2002 08:37:44 -0400 (EDT) +Mailbox-Line: From pgman@candle.pha.pa.us Mon Jun 24 08:37:44 2002 Received: from candle.pha.pa.us (216-55-132-35.dsl.san-diego.abac.net [216.55.132.35]) - by postgresql.org (Postfix) with ESMTP id ED459475C61 - for ; Sun, 23 Jun 2002 15:37:08 -0400 (EDT) + by postgresql.org (Postfix) with ESMTP id 5C971475858 + for ; Sun, 23 Jun 2002 22:47:12 -0400 (EDT) Received: (from pgman@localhost) - by candle.pha.pa.us (8.11.6/8.10.1) id g5NJasa07642; - Sun, 23 Jun 2002 15:36:54 -0400 (EDT) + by candle.pha.pa.us (8.11.6/8.10.1) id g5O2ki712992; + Sun, 23 Jun 2002 22:46:44 -0400 (EDT) From: Bruce Momjian -Message-ID: <200206231936.g5NJasa07642@candle.pha.pa.us> +Message-ID: <200206240246.g5O2ki712992@candle.pha.pa.us> Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE -In-Reply-To: -To: Curt Sampson -Date: Sun, 23 Jun 2002 15:36:54 -0400 (EDT) -cc: "J. R. Nield" , Tom Lane , +In-Reply-To: <1024882167.1793.733.camel@localhost.localdomain> +To: "J. R. Nield" +Date: Sun, 23 Jun 2002 22:46:44 -0400 (EDT) +cc: Curt Sampson , Tom Lane , Michael Loftis , mlw , PostgreSQL Hacker X-Mailer: ELM [version 2.4ME+ PL97 (25)] @@ -187,46 +1188,119 @@ Sender: pgsql-hackers-owner@postgresql.org X-Spam-Status: No, hits=-3.4 required=5.0 tests=IN_REP_TO version=2.30 -Status: RO +Status: OR -Curt Sampson wrote: -> On 23 Jun 2002, J. R. Nield wrote: -> -> > So since we have all this buffering designed especially to meet our -> > needs, and since the OS buffering is in the way, can someone explain to -> > me why postgresql would ever open a file without the O_DSYNC flag if the -> > platform supports it? +J. R. Nield wrote: +> On Sun, 2002-06-23 at 15:36, Bruce Momjian wrote: +> > Yes, I don't see writing to two files vs. one to be any win, especially +> > when we need to fsync both of them. What I would really like is to +> > avoid the double I/O of writing to WAL and to the data file; improving +> > that would be a huge win. +> > > -> It's more code, if there are platforms out there that don't support -> O_DYSNC. (We still have to keep the old fsync code.) On the other hand, -> O_DSYNC could save us a disk arm movement over fsync() because it -> appears to me that fsync is also going to force a metadata update, which -> means that the inode blocks have to be written as well. +> If is impossible to do what you want. You can not protect against +> partial writes without writing pages twice and calling fdatasync between +> them while going through a generic filesystem. The best disk array will +> not protect you if the operating system does not align block writes to +> the structure of the underlying device. Even with raw devices, you need +> special support or knowledge of the operating system and/or the disk +> device to ensure that each write request will be atomic to the +> underlying hardware. -Again, see postgresql.conf: +Yes, I suspected it was impossible, but that doesn't mean I want it any +less. ;-) -#wal_sync_method = fsync # the default varies across platforms: -# # fsync, fdatasync, open_sync, or open_datasync +> All other systems rely on the fact that you can recover a damaged file +> using the log archive. This means downtime in the rare case, but no data +> loss. Until PostgreSQL can do this, then it will not be acceptable for +> real critical production use. This is not to knock PostgreSQL, because +> it is a very good database system, and clearly the best open-source one. +> It even has feature advantages over the commercial systems. But at the +> end of the day, unless you have complete understanding of the I/O system +> from write(2) through to the disk system, the only sure ways to protect +> against partial writes are by "careful writes" (in the WAL log or +> elsewhere, writing pages twice), or by requiring (and allowing) users to +> do log-replay recovery when a file is corrupted by a partial write. As +> long as there is a UPS, and the operating system doesn't crash, then +> there still should be no partial writes. + +You are talking point-in-time recovery, a major missing feature right +next to replication, and I agree it makes PostgreSQL unacceptable for +some applications. Point taken. + +And the interesting thing you are saying is that with point-in-time +recovery, we don't need to write pre-write images of pages because if we +detect a partial page write, we then abort the database and tell the +user to do a point-in-time recovery, basically meaning we are using the +previous full backup as our pre-write page image and roll forward using +the logical logs. This is clearly a nice thing to be able to do because +it let's you take a pre-write image of the page once during full backup, +keep it offline, and bring it back in the rare case of a full page write +failure. I now can see how the MSSQL tearoff-bits would be used, not +for recovery, but to detect a partial write and force a point-in-time +recovery from the administrator. + +> If we log pages to WAL, they are useless when archived (after a +> checkpoint). So either we have a separate "log" for them (the ping-pong +> file), or we should at least remove them when archived, which makes log +> archiving more complex but is perfectly doable. + +Yes, that is how we will do point-in-time recovery; remove the +pre-write page images and archive the rest. It is more complex, but +having the fsync all in one file is too big a win. + +> Finally, I would love to hear why we are using the operating system +> buffer manager at all. The OS is acting as a secondary buffer manager +> for us. Why is that? What flaw in our I/O system does this reveal? I +> know that: > -> > Maybe fsync would be slower with two files, but I don't see how -> > fdatasync would be, and most platforms support that. -> -> Because, if both files are on the same disk, you still have to move -> the disk arm from the cylinder at the current log file write point -> to the cylinder at the current ping-pong file write point. And then back -> again to the log file write point cylinder. +> >We sync only WAL, not the other pages, except for the sync() call we do +> > during checkpoint when we discard old WAL files. > -> In the end, having a ping-pong file as well seems to me unnecessary -> complexity, especially when anyone interested in really good -> performance is going to buy a disk subsystem that guarantees no -> torn pages and thus will want to turn off the ping-pong file writes -> entirely, anyway. +> But this is probably not a good thing. We should only be writing blocks +> when they need to be on disk. We should not be expecting the OS to write +> them "sometime later" and avoid blocking (as long) for the write. If we +> need that, then our buffer management is wrong and we need to fix it. +> The reason we are doing this is because we expect the OS buffer manager +> to do asynchronous I/O for us, but then we don't control the order. That +> is the reason why we have to call fdatasync(), to create "sequence +> points". + +Yes. I think I understand. It is true we have to fsync WAL because we +can't control the individual writes by the OS. + +> The reason we have performance problems with either D_OSYNC or fdatasync +> on the normal relations is because we have no dbflush process. This +> causes an unacceptable amount of I/O blocking by other transactions. + +Uh, that would force writes all over the disk. Why do we really care how +the OS writes them? If we are going to fsync, let's just do the one +file and be done with it. What would a separate flusher process really +buy us if it has to use fsync too. The main backend doesn't have to +wait for the fsync, but then again, we can't say the transaction is +committed until it hits the disk, so how does a flusher help? -Yes, I don't see writing to two files vs. one to be any win, especially -when we need to fsync both of them. What I would really like is to -avoid the double I/O of writing to WAL and to the data file; improving -that would be a huge win. +> The ORACLE people were not kidding when they said that they could not +> certify Linux for production use until it supported O_DSYNC. Can you +> explain why that was the case? + +I don't see O_DSYNC as very different from write/fsync(or fdatasync). + +> Finally, let me apologize if the above comes across as somewhat +> belligerent. I know very well that I can't compete with you guys for +> knowledge of the PostgreSQL system. I am still at a loss when I look at +> the optimizer and executor modules, and it will take some time before I +> can follow discussion of that area. Even then, I doubt my ability to +> compare with people like Mr. Lane and Mr. Momjian in experience and +> general intelligence, or in the field of database programming and +> software development in particular. However, this discussion and a +> search of the pgsql-hackers archives reveals this problem to be the KEY +> area of PostgreSQL's failing, and general misunderstanding, when +> compared to its commercial competitors. + +We appreciate your ideas. Few of us are professional db folks so we are +always looking for good ideas. -- Bruce Momjian | http://candle.pha.pa.us @@ -237,7 +1311,9 @@ that would be a huge win. ---------------------------(end of broadcast)--------------------------- -TIP 4: Don't 'kill -9' the postmaster +TIP 6: Have you searched our list archives? + +http://archives.postgresql.org @@ -260,7 +1336,7 @@ In-Reply-To: <1024882167.1793.733.camel@localhost.localdomain> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII -Status: RO +Status: OR On 23 Jun 2002, J. R. Nield wrote: @@ -480,7 +1556,7 @@ Sender: pgsql-hackers-owner@postgresql.org X-Spam-Status: No, hits=-5.3 required=5.0 tests=IN_REP_TO,X_NOT_PRESENT version=2.30 -Status: RO +Status: OR > On Sun, 23 Jun 2002, Bruce Momjian wrote: >> Yes, I don't see writing to two files vs. one to be any win, especially @@ -560,7 +1636,7 @@ Sender: pgsql-hackers-owner@postgresql.org X-Spam-Status: No, hits=-3.4 required=5.0 tests=IN_REP_TO version=2.30 -Status: RO +Status: OR Tom Lane wrote: > > On Sun, 23 Jun 2002, Bruce Momjian wrote: @@ -603,6 +1679,417 @@ TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org +From pgsql-hackers-owner+M24114@postgresql.org Mon Jun 24 17:54:35 2002 +Return-path: +Received: from postgresql.org (postgresql.org [64.49.215.8]) + by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5OLsZF28642 + for ; Mon, 24 Jun 2002 17:54:35 -0400 (EDT) +Received: from localhost.localdomain (postgresql.org [64.49.215.8]) + by localhost (Postfix) with ESMTP + id BD68F47683C; Mon, 24 Jun 2002 16:46:24 -0400 (EDT) +Mailbox-Line: From tgl@sss.pgh.pa.us Mon Jun 24 16:46:24 2002 +Received: from postgresql.org (postgresql.org [64.49.215.8]) + by postgresql.org (Postfix) with SMTP + id B2719476B31; Mon, 24 Jun 2002 16:01:51 -0400 (EDT) +Received: from localhost.localdomain (postgresql.org [64.49.215.8]) + by localhost (Postfix) with ESMTP id 950004770BC + for ; Mon, 24 Jun 2002 14:59:46 -0400 (EDT) +Mailbox-Line: From tgl@sss.pgh.pa.us Mon Jun 24 14:59:46 2002 +Received: from sss.pgh.pa.us (unknown [192.204.191.242]) + by postgresql.org (Postfix) with ESMTP id A0756475BB7 + for ; Mon, 24 Jun 2002 13:11:41 -0400 (EDT) +Received: from sss2.sss.pgh.pa.us (tgl@localhost [127.0.0.1]) + by sss.pgh.pa.us (8.11.4/8.11.4) with ESMTP id g5OHB1119826; + Mon, 24 Jun 2002 13:11:02 -0400 (EDT) +To: Bruce Momjian +cc: Curt Sampson , "J. R. Nield" , + Michael Loftis , mlw , + PostgreSQL Hacker +Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE +In-Reply-To: <200206241640.g5OGeVY06116@candle.pha.pa.us> +References: <200206241640.g5OGeVY06116@candle.pha.pa.us> +Comments: In-reply-to Bruce Momjian + message dated "Mon, 24 Jun 2002 12:40:31 -0400" +Date: Mon, 24 Jun 2002 13:11:01 -0400 +Message-ID: <19823.1024938661@sss.pgh.pa.us> +From: Tom Lane +Precedence: bulk +Sender: pgsql-hackers-owner@postgresql.org +X-Spam-Status: No, hits=-5.3 required=5.0 + tests=IN_REP_TO,X_NOT_PRESENT + version=2.30 +Status: OR + +Bruce Momjian writes: +>> The only thing I've been able to think of that seems like it might +>> improve matters is to make the WAL writing logic aware of the layout +>> of buffer pages --- specifically, to know that our pages generally +>> contain an uninteresting "hole" in the middle, and not write the hole. +>> Optimistically this might reduce the WAL data volume by something +>> approaching 50%; though pessimistically (if most pages are near full) +>> it wouldn't help much. + +> Good idea. How about putting the page through or TOAST compression +> routine before writing it to WAL? Should be pretty easy and fast and +> doesn't require any knowledge of the page format. + +Easy, maybe, but fast definitely NOT. The compressor is not speedy. +Given that we have to be holding various locks while we build WAL +records, I do not think it's a good idea to add CPU time there. + +Also, compressing already-compressed data is not a win ... + + regards, tom lane + + + +---------------------------(end of broadcast)--------------------------- +TIP 3: if posting/reading through Usenet, please send an appropriate +subscribe-nomail command to majordomo@postgresql.org so that your +message can get through to the mailing list cleanly + + + +From jrnield@usol.com Mon Jun 24 16:49:25 2002 +Return-path: +Received: from hades.usol.com (IDENT:root@hades.usol.com [208.232.58.41]) + by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5OKnNF23393 + for ; Mon, 24 Jun 2002 16:49:24 -0400 (EDT) +Received: from 08-113.024.popsite.net (08-113.024.popsite.net [66.19.4.113]) + by hades.usol.com (8.11.6/8.11.6) with ESMTP id g5OKnHV19100; + Mon, 24 Jun 2002 16:49:18 -0400 +Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE +From: "J. R. Nield" +To: Curt Sampson +cc: Bruce Momjian , Tom Lane , + PostgreSQL Hacker +In-Reply-To: +References: +Content-Type: text/plain +Content-Transfer-Encoding: 7bit +X-Mailer: Ximian Evolution 1.0.3 (1.0.3-6) +Date: 24 Jun 2002 16:49:42 -0400 +Message-ID: <1024951786.1793.865.camel@localhost.localdomain> +MIME-Version: 1.0 +Status: ORr + +On Sun, 2002-06-23 at 23:40, Curt Sampson wrote: +> On 23 Jun 2002, J. R. Nield wrote: +> +> > If is impossible to do what you want. You can not protect against +> > partial writes without writing pages twice and calling fdatasync +> > between them while going through a generic filesystem. +> +> I agree with this. +> +> > The best disk array will not protect you if the operating system does +> > not align block writes to the structure of the underlying device. +> +> This I don't quite understand. Assuming you're using a SCSI drive +> (and this mostly applies to ATAPI/IDE, too), you can do naught but +> align block writes to the structure of the underlying device. When you +> initiate a SCSI WRITE command, you start by telling the device at which +> block to start writing and how many blocks you intend to write. Then you +> start passing the data. +> + +All I'm saying is that the entire postgresql block write must be +converted into exactly one SCSI write command in all cases, and I don't +know a portable way to ensure this. + +> > Even with raw devices, you need special support or knowledge of the +> > operating system and/or the disk device to ensure that each write +> > request will be atomic to the underlying hardware. +> +> Well, so here I guess you're talking about two things: +> +> 1. When you request, say, an 8K block write, will the OS really +> write it to disk in a single 8K or multiple of 8K SCSI write +> command? +> +> 2. Does the SCSI device you're writing to consider these writes to +> be transactional. That is, if the write is interrupted before being +> completed, does the SCSI device guarantee that the partially-sent +> data is not written, and the old data is maintained? And of course, +> does it guarantee that, when it acknowledges a write, that write is +> now in stable storage and will never go away? +> +> Both of these are not hard to guarantee, actually. For a BSD-based OS, +> for example, just make sure that your filesystem block size is the +> same as or a multiple of the database block size. BSD will never write +> anything other than a block or a sequence of blocks to a disk in a +> single SCSI transaction (unless you've got a really odd SCSI driver). +> And for your disk, buy a Baydel or Clarion disk array, or something +> similar. +> +> Given that it's not hard to set up a system that meets these criteria, +> and this is in fact commonly done for database servers, it would seem a +> good idea for postgres to have the option to take advantage of the time +> and money spent and adjust its performance upward appropriately. + +I agree with this. My point was only that you need to know what +guarantees your operating system/hardware combination provides on a +case-by-case basis, and there is no standard way for a program to +discover this. Most system administrators are not going to know this +either, unless databases are their main responsibility. + +> +> > All other systems rely on the fact that you can recover a damaged file +> > using the log archive. +> +> Not exactly. For MS SQL Server, at any rate, if it detects a page tear +> you cannot restore based on the log file alone. You need a full or +> partial backup that includes that entire torn block. +> + +I should have been more specific: you need a backup of the file from +some time ago, plus all the archived logs from then until the current +log sequence number. + +> > This means downtime in the rare case, but no data loss. Until +> > PostgreSQL can do this, then it will not be acceptable for real +> > critical production use. +> +> It seems to me that it is doing this right now. In fact, it's more +> reliable than some commerial systems (such as SQL Server) because it can +> recover from a torn block with just the logfile. + +Again, what I meant to say is that the commercial systems can recover +with an old file backup + logs. How old the backup can be depends only +on how much time you are willing to spend playing the logs forward. So +if you do a full backup once a week, and multiplex and backup the logs, +then even if a backup tape gets destroyed you can still survive. It just +takes longer. + +Also, postgreSQL can't recover from any other type of block corruption, +while the commercial systems can. That's what I meant by the "critical +production use" comment, which was sort-of unfair. + +So I would say they are equally reliable for torn pages (but not bad +blocks), and the commercial systems let you trade potential recovery +time for not having to write the blocks twice. You do need to back-up +the log archives though. + +> +> > But at the end of the day, unless you have complete understanding of +> > the I/O system from write(2) through to the disk system, the only sure +> > ways to protect against partial writes are by "careful writes" (in +> > the WAL log or elsewhere, writing pages twice), or by requiring (and +> > allowing) users to do log-replay recovery when a file is corrupted by +> > a partial write. +> +> I don't understand how, without a copy of the old data that was in the +> torn block, you can restore that block from just log file entries. Can +> you explain this to me? Take, as an example, a block with ten tuples, +> only one of which has been changed "recently." (I.e., only that change +> is in the log files.) +> +> +> > If we log pages to WAL, they are useless when archived (after a +> > checkpoint). So either we have a separate "log" for them (the +> > ping-pong file), or we should at least remove them when archived, +> > which makes log archiving more complex but is perfectly doable. +> +> Right. That seems to me a better option, since we've now got only one +> write point on the disk rather than two. + +OK. I agree with this now. + +> +> > Finally, I would love to hear why we are using the operating system +> > buffer manager at all. The OS is acting as a secondary buffer manager +> > for us. Why is that? What flaw in our I/O system does this reveal? +> +> It's acting as a "second-level" buffer manager, yes, but to say it's +> "secondary" may be a bit misleading. On most of the systems I've set +> up, the OS buffer cache is doing the vast majority of the work, and the +> postgres buffering is fairly minimal. +> +> There are some good (and some perhaps not-so-good) reasons to do it this +> way. I'll list them more or less in the order of best to worst: +> +> 1. The OS knows where the blocks physically reside on disk, and +> postgres does not. Therefore it's in the interest of postgresql to +> dispatch write responsibility back to the OS as quickly as possible +> so that the OS can prioritize requests appropriately. Most operating +> systems use an "elevator" algorithm to minimize disk head movement; +> but if the OS does not have a block that it could write while the +> head is "on the way" to another request, it can't write it in that +> head pass. +> +> 2. Postgres does not know about any "bank-switching" tricks for +> mapping more physical memory than it has address space. Thus, on +> 32-bit machines, postgres might be limited to mapping 2 or 3 GB of +> memory, even though the machine has, say, 6 GB of physical RAM. The +> OS can use all of the available memory for caching; postgres cannot. +> +> 3. A lot of work has been put into the seek algorithms, read-ahead +> algorithms, block allocation algorithms, etc. in the OS. Why +> duplicate all that work again in postgres? +> +> When you say things like the following: +> +> > We should only be writing blocks when they need to be on disk. We +> > should not be expecting the OS to write them "sometime later" and +> > avoid blocking (as long) for the write. If we need that, then our +> > buffer management is wrong and we need to fix it. +> +> you appear to be making the arugment that we should take the route of +> other database systems, and use raw devices and our own management of +> disk block allocation. If so, you might want first to look back through +> the archives at the discussion I and several others had about this a +> month or two ago. After looking in detail at what NetBSD, at least, does +> in terms of its disk I/O algorithms and buffering, I've pretty much come +> around, at least for the moment, to the attitude that we should stick +> with using the OS. I wouldn't mind seeing postgres be able to manage all +> of this stuff, but it's a *lot* of work for not all that much benefit +> that I can see. + +I'll back off on that. I don't know if we want to use the OS buffer +manager, but shouldn't we try to have our buffer manager group writes +together by files, and pro-actively get them out to disk? Right now, it +looks like all our write requests are delayed as long as possible and +the order in which they are written is pretty-much random, as is the +backend that writes the block, so there is no locality of reference even +when the blocks are adjacent on disk, and the write calls are spread-out +over all the backends. + +Would it not be the case that things like read-ahead, grouping writes, +and caching written data are probably best done by PostgreSQL, because +only our buffer manager can understand when they will be useful or when +they will thrash the cache? + +I may likely be wrong on this, and I haven't done any performance +testing. I shouldn't have brought this up alongside the logging issues, +but there seemed to be some question about whether the OS was actually +doing all these things behind the scene. + + +> +> > The ORACLE people were not kidding when they said that they could not +> > certify Linux for production use until it supported O_DSYNC. Can you +> > explain why that was the case? +> +> I'm suspecting it's because Linux at the time had no raw devices, so +> O_DSYNC was the only other possible method of making sure that disk +> writes actually got to disk. +> +> You certainly don't want to use O_DSYNC if you can use another method, +> because O_DSYNC still goes through the the operating system's buffer +> cache, wasting memory and double-caching things. If you're doing your +> own management, you need either to use a raw device or open files with +> the flag that indicates that the buffer cache should not be used at all +> for reads from and writes to that file. + +Would O_DSYNC|O_RSYNC turn off the cache? + +> +> > However, this discussion and a search of the pgsql-hackers archives +> > reveals this problem to be the KEY area of PostgreSQL's failing, and +> > general misunderstanding, when compared to its commercial competitors. +> +> No, I think it's just that you're under a few minor misapprehensions +> here about what postgres and the OS are actually doing. As I said, I +> went through this whole exact argument a month or two ago, on this very +> list, and I came around to the idea that what postgres is doing now +> works quite well, at least on NetBSD. (Most other OSes have disk I/O +> algorithms that are pretty much as good or better.) There might be a +> very slight advantage to doing all one's own I/O management, but it's +> a huge amount of work, and I think that much effort could be much more +> usefully applied to other areas. + +I will look for that discussion in the archives. + +The logging issue is a key one I think. At least I would be very nervous +as a DBA if I were running a system where any damaged file would cause +data loss. + +Does anyone know what the major barriers to infinite log replay are in +PostgreSQL? I'm trying to look for everything that might need to be +changed outside xlog.c, but surely this has come up before. Searching +the archives hasn't revealed much. + + + +As to the I/O issue: + +Since you know a lot about NetBSD internals, I'd be interested in +hearing about what postgresql looks like to the NetBSD buffer manager. +Am I right that strings of successive writes get randomized? What do our +cache-hit percentages look like? I'm going to do some experimenting with +this. + +> +> Just as a side note, I've been a NetBSD developer since about '96, +> and have been delving into the details of OS design since well before +> that time, so I'm coming to this with what I hope is reasonably good +> knowledge of how disks work and how operating systems use them. (Not +> that this should stop you from pointing out holes in my arguments. :-)) +> + +This stuff is very difficult to get right. Glad to know you follow this +list. + + +> cjs +> -- +> Curt Sampson +81 90 7737 2974 http://www.netbsd.org +> Don't you know, in this new Dark Age, we're all light. --XTC +> +-- +J. R. Nield +jrnield@usol.com + + + + +From tgl@sss.pgh.pa.us Mon Jun 24 17:16:06 2002 +Return-path: +Received: from sss.pgh.pa.us (root@[192.204.191.242]) + by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5OLG5F25284 + for ; Mon, 24 Jun 2002 17:16:05 -0400 (EDT) +Received: from sss2.sss.pgh.pa.us (tgl@localhost [127.0.0.1]) + by sss.pgh.pa.us (8.11.4/8.11.4) with ESMTP id g5OLG2121379; + Mon, 24 Jun 2002 17:16:02 -0400 (EDT) +To: "J. R. Nield" +cc: Curt Sampson , Bruce Momjian , + PostgreSQL Hacker +Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE +In-Reply-To: <1024951786.1793.865.camel@localhost.localdomain> +References: <1024951786.1793.865.camel@localhost.localdomain> +Comments: In-reply-to "J. R. Nield" + message dated "24 Jun 2002 16:49:42 -0400" +Date: Mon, 24 Jun 2002 17:16:01 -0400 +Message-ID: <21376.1024953361@sss.pgh.pa.us> +From: Tom Lane +Status: OR + +"J. R. Nield" writes: +> Also, postgreSQL can't recover from any other type of block corruption, +> while the commercial systems can. + +Say again? + +> Would it not be the case that things like read-ahead, grouping writes, +> and caching written data are probably best done by PostgreSQL, because +> only our buffer manager can understand when they will be useful or when +> they will thrash the cache? + +I think you have been missing the point. No one denies that there will +be some incremental gain if we do all that. However, the conclusion of +everyone who has thought much about it (and I see Curt has joined that +group) is that the effort would be far out of proportion to the probable +gain. There are a lot of other things we desperately need to spend time +on that would not amount to re-engineering large quantities of OS-level +code. Given that most Unixen have perfectly respectable disk management +subsystems, we prefer to tune our code to make use of that stuff, rather +than follow the "conventional wisdom" that databases need to bypass it. + +Oracle can afford to do that sort of thing because they have umpteen +thousand developers available. Postgres does not. + + regards, tom lane + From pgsql-hackers-owner+M24128@postgresql.org Mon Jun 24 22:01:58 2002 Return-path: Received: from postgresql.org (postgresql.org [64.49.215.8]) @@ -642,7 +2129,7 @@ Sender: pgsql-hackers-owner@postgresql.org X-Spam-Status: No, hits=-3.4 required=5.0 tests=IN_REP_TO version=2.30 -Status: RO +Status: OR J. R. Nield wrote: > > This I don't quite understand. Assuming you're using a SCSI drive @@ -764,7 +2251,7 @@ Comments: In-reply-to Bruce Momjian Date: Mon, 24 Jun 2002 17:31:56 -0400 Message-ID: <21482.1024954316@sss.pgh.pa.us> From: Tom Lane -Status: ROr +Status: ORr Bruce Momjian writes: >> Does anyone know what the major barriers to infinite log replay are in @@ -828,7 +2315,7 @@ Sender: pgsql-hackers-owner@postgresql.org X-Spam-Status: No, hits=-5.3 required=5.0 tests=IN_REP_TO,X_NOT_PRESENT version=2.30 -Status: RO +Status: OR Bruce Momjian writes: >> Does anyone know what the major barriers to infinite log replay are in @@ -900,7 +2387,7 @@ Sender: pgsql-hackers-owner@postgresql.org X-Spam-Status: No, hits=-3.4 required=5.0 tests=IN_REP_TO version=2.30 -Status: RO +Status: OR Tom Lane wrote: > Bruce Momjian writes: @@ -988,7 +2475,7 @@ Sender: pgsql-hackers-owner@postgresql.org X-Spam-Status: No, hits=-3.4 required=5.0 tests=IN_REP_TO version=2.30 -Status: RO +Status: OR On Mon, 2002-06-24 at 17:16, Tom Lane wrote: @@ -1035,7 +2522,7 @@ X-Mailer: Ximian Evolution 1.0.3 (1.0.3-6) Date: 24 Jun 2002 20:28:00 -0400 Message-ID: <1024964884.3031.876.camel@localhost.localdomain> MIME-Version: 1.0 -Status: RO +Status: OR On Mon, 2002-06-24 at 17:16, Tom Lane wrote: @@ -1050,6 +2537,50 @@ jrnield@usol.com +From cjs@cynic.net Mon Jun 24 23:32:23 2002 +Return-path: +Received: from academic.cynic.net ([63.144.177.3]) + by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5P3WMF28287 + for ; Mon, 24 Jun 2002 23:32:23 -0400 (EDT) +Received: from angelic-academic.cvpn.cynic.net (angelic-academic.cvpn.cynic.net [198.73.220.224]) + by academic.cynic.net (Postfix) with ESMTP + id 28AB5F820; Tue, 25 Jun 2002 03:32:08 +0000 (UTC) +Date: Tue, 25 Jun 2002 12:32:05 +0900 (JST) +From: Curt Sampson +To: "J. R. Nield" +cc: Bruce Momjian , Tom Lane , + PostgreSQL Hacker +Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE +In-Reply-To: <1024951786.1793.865.camel@localhost.localdomain> +Message-ID: +MIME-Version: 1.0 +Content-Type: TEXT/PLAIN; charset=US-ASCII +Status: OR + +On 24 Jun 2002, J. R. Nield wrote: + +> All I'm saying is that the entire postgresql block write must be +> converted into exactly one SCSI write command in all cases, and I don't +> know a portable way to ensure this. + +No, there's no portable way. All you can do is give the admin who +is able to set things up safely the ability to turn of the now-unneeded +(and expensive) safety-related stuff that postgres does. + +> I agree with this. My point was only that you need to know what +> guarantees your operating system/hardware combination provides on a +> case-by-case basis, and there is no standard way for a program to +> discover this. Most system administrators are not going to know this +> either, unless databases are their main responsibility. + +Certainly this is true of pretty much every database system out there. + +cjs +-- +Curt Sampson +81 90 7737 2974 http://www.netbsd.org + Don't you know, in this new Dark Age, we're all light. --XTC + + From cjs@cynic.net Tue Jun 25 01:09:02 2002 Return-path: Received: from academic.cynic.net (academic.cynic.net [63.144.177.3]) @@ -1068,7 +2599,7 @@ In-Reply-To: <21376.1024953361@sss.pgh.pa.us> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII -Status: ROr +Status: ORr On Mon, 24 Jun 2002, Tom Lane wrote: @@ -1131,7 +2662,7 @@ Sender: pgsql-hackers-owner@postgresql.org X-Spam-Status: No, hits=-3.4 required=5.0 tests=IN_REP_TO version=2.30 -Status: RO +Status: OR Curt Sampson wrote: > On Mon, 24 Jun 2002, Tom Lane wrote: @@ -1204,7 +2735,7 @@ X-Virus-Scanned: by AMaViS new-20020517 Precedence: bulk Sender: pgsql-hackers-owner@postgresql.org X-Virus-Scanned: by AMaViS new-20020517 -Status: ROr +Status: ORr I have been experimenting with empirical tests of file system and device level writes to determine the actual constraints in order to speed up the WAL @@ -1422,7 +2953,7 @@ X-Mailing-List: pgsql-hackers Precedence: bulk Sender: pgsql-hackers-owner@postgresql.org X-Virus-Scanned: by amavisd-new at hub.org -Status: RO +Status: OR > One of the things I was thinking about was whether we could use up those