From: Bruce Momjian Date: Fri, 3 Jun 2005 17:14:17 +0000 (+0000) Subject: Trim TODO.detail for wal completed items. X-Git-Tag: REL8_1_0BETA1~668 X-Git-Url: https://granicus.if.org/sourcecode?a=commitdiff_plain;h=97e455c9b93426cfc6259910612326e03acb5a75;p=postgresql Trim TODO.detail for wal completed items. --- diff --git a/doc/TODO.detail/wal b/doc/TODO.detail/wal index 244f4f7519..37d2edd432 100644 --- a/doc/TODO.detail/wal +++ b/doc/TODO.detail/wal @@ -1,84 +1,3 @@ -From cjs@cynic.net Thu Jun 20 22:18:27 2002 -Return-path: -Received: from academic.cynic.net (academic.cynic.net [63.144.177.3]) - by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5L2IPo22195 - for ; Thu, 20 Jun 2002 22:18:26 -0400 (EDT) -Received: from angelic-academic.cvpn.cynic.net (angelic-academic.cvpn.cynic.net [198.73.220.224]) - by academic.cynic.net (Postfix) with ESMTP - id 88216F821; Fri, 21 Jun 2002 02:18:17 +0000 (UTC) -Date: Fri, 21 Jun 2002 11:18:14 +0900 (JST) -From: Curt Sampson -To: Bruce Momjian -cc: Michael Loftis , mlw , - PostgreSQL-development -Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE -In-Reply-To: <200206210158.g5L1wFk20118@candle.pha.pa.us> -Message-ID: -MIME-Version: 1.0 -Content-Type: TEXT/PLAIN; charset=US-ASCII -Status: OR - -On Thu, 20 Jun 2002, Bruce Momjian wrote: - -> > MS SQL Server has an interesting way of dealing with this. They have a -> > "torn" bit in each 512-byte chunk of a page, and this bit is set the -> > same for each chunk. When they are about to write out a page, they first -> > flip all of the torn bits and then do the write. If the write does not -> > complete due to a system crash or whatever, this can be detected later -> > because the torn bits won't match across the entire page. -> -> I was wondering, how does knowing the block is corrupt help MS SQL? - -I'm trying to recall, but I can't off hand. I'll have to look it -up in my Inside SQL Server book, which is at home right now, -unfortunately. I'll bring the book into work and let you know the -details later. - -> Right now, we write changed pages to WAL, then later write them to disk. - -Ah. You write the entire page? MS writes only the changed tuple. -And DB2, in fact, goes one better and writes only the part of the -tuple up to the change, IIRC. Thus, if you put smaller and/or more -frequently changed columns first, you'll have smaller logs. - -> I have always been looking for a way to prevent these WAL writes. The -> 512-byte bit seems interesting, but how does it help? - -Well, this would at least let you reduce the write to the 512-byte -chunk that changed, rather than writing the entire 8K page. - -> And how does the bit help them with partial block writes? Is the bit at -> the end of the block? Is that reliable? - -The bit is somewhere within every 512 byte "disk page" within the -8192 byte "filesystem/database page." So an 8KB page is divided up -like this: - - | <----------------------- 8 Kb ----------------------> | - - | 512b | 512b | 512b | 512b | 512b | 512b | 512b | 512b | - -Thus, the tear bits start out like this: - - | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | - -After a successful write of the entire page, you have this: - - | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | - -If the write is unsuccessful, you end up with something like this: - - | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | - -And now you know which parts of your page got written, and which -parts didn't. - -cjs --- -Curt Sampson +81 90 7737 2974 http://www.netbsd.org - Don't you know, in this new Dark Age, we're all light. --XTC - - From cjs@cynic.net Sat Jun 22 04:41:54 2002 Return-path: Received: from academic.cynic.net ([63.144.177.3]) @@ -98,7 +17,7 @@ In-Reply-To: <19332.1024668861@sss.pgh.pa.us> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII -Status: OR +Status: RO On Fri, 21 Jun 2002, Tom Lane wrote: @@ -149,681 +68,6 @@ Curt Sampson +81 90 7737 2974 http://www.netbsd.org Don't you know, in this new Dark Age, we're all light. --XTC -From pgsql-hackers-owner+M24060@postgresql.org Sat Jun 22 18:31:21 2002 -Return-path: -Received: from postgresql.org (postgresql.org [64.49.215.8]) - by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5MMVKF20014 - for ; Sat, 22 Jun 2002 18:31:20 -0400 (EDT) -Received: from localhost.localdomain (postgresql.org [64.49.215.8]) - by localhost (Postfix) with ESMTP - id 0ADFE476090; Sat, 22 Jun 2002 18:31:10 -0400 (EDT) -Received: from postgresql.org (postgresql.org [64.49.215.8]) - by postgresql.org (Postfix) with SMTP - id 6B372475A96; Sat, 22 Jun 2002 18:28:42 -0400 (EDT) -Received: from localhost.localdomain (postgresql.org [64.49.215.8]) - by localhost (Postfix) with ESMTP id 47AD2475935 - for ; Sat, 22 Jun 2002 18:28:40 -0400 (EDT) -Received: from hades.usol.com (hades.usol.com [208.232.58.41]) - by postgresql.org (Postfix) with ESMTP id 1D5DA476166 - for ; Sat, 22 Jun 2002 18:23:16 -0400 (EDT) -Received: from 01-081.024.popsite.net (01-081.024.popsite.net [216.126.160.81]) - by hades.usol.com (8.11.6/8.11.6) with ESMTP id g5MMMOj11344; - Sat, 22 Jun 2002 18:22:25 -0400 -Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE -From: "J. R. Nield" -To: Bruce Momjian -cc: Curt Sampson , Michael Loftis , - mlw , - PostgreSQL Hacker , - Tom Lane -In-Reply-To: <200206210158.g5L1wFk20118@candle.pha.pa.us> -References: <200206210158.g5L1wFk20118@candle.pha.pa.us> -Content-Type: text/plain -Content-Transfer-Encoding: 7bit -Message-ID: <1024784514.1793.242.camel@localhost.localdomain> -MIME-Version: 1.0 -X-Mailer: Ximian Evolution 1.0.3 (1.0.3-6) -Date: 22 Jun 2002 18:22:58 -0400 -Precedence: bulk -Sender: pgsql-hackers-owner@postgresql.org -Status: ORr - -On Thu, 2002-06-20 at 21:58, Bruce Momjian wrote: -> I was wondering, how does knowing the block is corrupt help MS SQL? -> Right now, we write changed pages to WAL, then later write them to disk. -> I have always been looking for a way to prevent these WAL writes. The -> 512-byte bit seems interesting, but how does it help? -> -> And how does the bit help them with partial block writes? Is the bit at -> the end of the block? Is that reliable? -> - -My understanding of this is as follows: - -1) On most commercial systems, if you get a corrupted block (from -partial write or whatever) you need to restore the file(s) from the most -recent backup, and replay the log from the log archive (usually only the -damaged files will be written to during replay). - -2) If you can't deal with the downtime to recover the file, then EMC, -Sun, or IBM will sell you an expensive disk array with an NVRAM cache -that will do atomic writes. Some plain-vanilla SCSI disks are also -capable of atomic writes, though usually they don't use NVRAM to do it. - -The database must then make sure that each page-write gets translated -into exactly one SCSI-level write. This is one reason why ORACLE and -Sybase recommend that you use raw disk partitions for high availability. -Some operating systems support this through the filesystem, but it is OS -dependent. I think Solaris 7 & 8 has support for this, but I'm not sure. - -PostgreSQL has trouble because it can neither archive logs for replay, -nor use raw disk partitions. - - -One other point: - -Page pre-image logging is fundamentally the same as what Jim Grey's -book[1] would call "careful writes". I don't believe they should be in -the XLOG, because we never need to keep the pre-images after we're sure -the buffer has made it to the disk. Instead, we should have the buffer -IO routines implement ping-pong writes of some kind if we want -protection from partial writes. - - -Does any of this make sense? - - - -;jrnield - - -[1] Grey, J. and Reuter, A. (1993). "Transaction Processing: Concepts - and Techniques". Morgan Kaufmann. - --- -J. R. Nield -jrnield@usol.com - - - - ----------------------------(end of broadcast)--------------------------- -TIP 4: Don't 'kill -9' the postmaster - -From pgsql-hackers-owner+M24068@postgresql.org Sun Jun 23 08:40:27 2002 -Return-path: -Received: from postgresql.org (postgresql.org [64.49.215.8]) - by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5NCeQF01601 - for ; Sun, 23 Jun 2002 08:40:27 -0400 (EDT) -Received: from localhost.localdomain (postgresql.org [64.49.215.8]) - by localhost (Postfix) with ESMTP - id 8AC4B475CBC; Sun, 23 Jun 2002 08:40:22 -0400 (EDT) -Received: from postgresql.org (postgresql.org [64.49.215.8]) - by postgresql.org (Postfix) with SMTP - id 4683647599D; Sun, 23 Jun 2002 08:37:40 -0400 (EDT) -Received: from localhost.localdomain (postgresql.org [64.49.215.8]) - by localhost (Postfix) with ESMTP id 0D57847592A - for ; Sun, 23 Jun 2002 08:37:38 -0400 (EDT) -Received: from hades.usol.com (hades.usol.com [208.232.58.41]) - by postgresql.org (Postfix) with ESMTP id 75326475876 - for ; Sun, 23 Jun 2002 08:37:36 -0400 (EDT) -Received: from 08-032.024.popsite.net (08-032.024.popsite.net [66.19.4.32]) - by hades.usol.com (8.11.6/8.11.6) with ESMTP id g5NCbNj02111; - Sun, 23 Jun 2002 08:37:23 -0400 -Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE -From: "J. R. Nield" -To: Bruce Momjian -cc: Curt Sampson , Michael Loftis , - mlw , - PostgreSQL Hacker , - Tom Lane -In-Reply-To: <200206222317.g5MNHBn23427@candle.pha.pa.us> -References: <200206222317.g5MNHBn23427@candle.pha.pa.us> -Content-Type: text/plain -Content-Transfer-Encoding: 7bit -X-Mailer: Ximian Evolution 1.0.3 (1.0.3-6) -Date: 23 Jun 2002 08:37:53 -0400 -Message-ID: <1024835880.1793.264.camel@localhost.localdomain> -MIME-Version: 1.0 -Precedence: bulk -Sender: pgsql-hackers-owner@postgresql.org -Status: OR - -On Sat, 2002-06-22 at 19:17, Bruce Momjian wrote: -> J. R. Nield wrote: -> > One other point: -> > -> > Page pre-image logging is fundamentally the same as what Jim Grey's -> > book[1] would call "careful writes". I don't believe they should be in -> > the XLOG, because we never need to keep the pre-images after we're sure -> > the buffer has made it to the disk. Instead, we should have the buffer -> > IO routines implement ping-pong writes of some kind if we want -> > protection from partial writes. -> -> Ping-pong writes to where? We have to fsync, and rather than fsync that -> area and WAL, we just do WAL. Not sure about a win there. -> - -The key question is: do we have some method to ensure that the OS -doesn't do the writes in parallel? - -If the OS will ensure that one of the two block writes of a ping-pong -completes before the other starts, then we don't need to fsync() at -all. - -The only thing we are protecting against is the possibility of both -writes being partial. If neither is done, that's fine because WAL will -protect us. If the first write is partial, we will detect that and use -the old data from the other, then recover from WAL. If the first is -complete but the second is partial, then we detect that and use the -newer block from the first write. If the second is complete but the -first is partial, we detect that and use the newer block from the second -write. - -So does anyone know a way to prevent parallel writes in one of the -common unix standards? Do they say anything about this? - -It would seem to me that if the same process does both ping-pong writes, -then there should be a cheap way to enforce a serial order. I could be -wrong though. - -As to where the first block of the ping-pong should go, maybe we could -reserve a file with nBlocks space for them, and write the information -about which block was being written to the XLOG for use in recovery. -There are many other ways to do it. - -;jrnield - --- -J. R. Nield -jrnield@usol.com - - - - ----------------------------(end of broadcast)--------------------------- -TIP 6: Have you searched our list archives? - -http://archives.postgresql.org - -From jrnield@usol.com Sun Jun 23 08:37:30 2002 -Return-path: -Received: from hades.usol.com (IDENT:root@hades.usol.com [208.232.58.41]) - by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5NCbRF28741 - for ; Sun, 23 Jun 2002 08:37:28 -0400 (EDT) -Received: from 08-032.024.popsite.net (08-032.024.popsite.net [66.19.4.32]) - by hades.usol.com (8.11.6/8.11.6) with ESMTP id g5NCbNj02111; - Sun, 23 Jun 2002 08:37:23 -0400 -Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE -From: "J. R. Nield" -To: Bruce Momjian -cc: Curt Sampson , Michael Loftis , - mlw - , - PostgreSQL Hacker , - Tom Lane -In-Reply-To: <200206222317.g5MNHBn23427@candle.pha.pa.us> -References: <200206222317.g5MNHBn23427@candle.pha.pa.us> -Content-Type: text/plain -Content-Transfer-Encoding: 7bit -X-Mailer: Ximian Evolution 1.0.3 (1.0.3-6) -Date: 23 Jun 2002 08:37:53 -0400 -Message-ID: <1024835880.1793.264.camel@localhost.localdomain> -MIME-Version: 1.0 -Status: OR - -On Sat, 2002-06-22 at 19:17, Bruce Momjian wrote: -> J. R. Nield wrote: -> > One other point: -> > -> > Page pre-image logging is fundamentally the same as what Jim Grey's -> > book[1] would call "careful writes". I don't believe they should be in -> > the XLOG, because we never need to keep the pre-images after we're sure -> > the buffer has made it to the disk. Instead, we should have the buffer -> > IO routines implement ping-pong writes of some kind if we want -> > protection from partial writes. -> -> Ping-pong writes to where? We have to fsync, and rather than fsync that -> area and WAL, we just do WAL. Not sure about a win there. -> - -The key question is: do we have some method to ensure that the OS -doesn't do the writes in parallel? - -If the OS will ensure that one of the two block writes of a ping-pong -completes before the other starts, then we don't need to fsync() at -all. - -The only thing we are protecting against is the possibility of both -writes being partial. If neither is done, that's fine because WAL will -protect us. If the first write is partial, we will detect that and use -the old data from the other, then recover from WAL. If the first is -complete but the second is partial, then we detect that and use the -newer block from the first write. If the second is complete but the -first is partial, we detect that and use the newer block from the second -write. - -So does anyone know a way to prevent parallel writes in one of the -common unix standards? Do they say anything about this? - -It would seem to me that if the same process does both ping-pong writes, -then there should be a cheap way to enforce a serial order. I could be -wrong though. - -As to where the first block of the ping-pong should go, maybe we could -reserve a file with nBlocks space for them, and write the information -about which block was being written to the XLOG for use in recovery. -There are many other ways to do it. - -;jrnield - --- -J. R. Nield -jrnield@usol.com - - - - -From cjs@cynic.net Sun Jun 23 09:33:29 2002 -Return-path: -Received: from academic.cynic.net (academic.cynic.net [63.144.177.3]) - by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5NDXSF11543 - for ; Sun, 23 Jun 2002 09:33:28 -0400 (EDT) -Received: from angelic-academic.cvpn.cynic.net (angelic-academic.cvpn.cynic.net [198.73.220.224]) - by academic.cynic.net (Postfix) with ESMTP - id A83ABF820; Sun, 23 Jun 2002 13:33:15 +0000 (UTC) -Date: Sun, 23 Jun 2002 22:33:07 +0900 (JST) -From: Curt Sampson -To: "J. R. Nield" -cc: Bruce Momjian , Michael Loftis , - mlw , - PostgreSQL Hacker , - Tom Lane -Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE -In-Reply-To: <1024835880.1793.264.camel@localhost.localdomain> -Message-ID: -MIME-Version: 1.0 -Content-Type: TEXT/PLAIN; charset=US-ASCII -Status: OR - -On 23 Jun 2002, J. R. Nield wrote: - -> On Sat, 2002-06-22 at 19:17, Bruce Momjian wrote: -> > J. R. Nield wrote: -> > > One other point: -> > > -> > > Page pre-image logging is fundamentally the same as what Jim Grey's -> > > book[1] would call "careful writes". I don't believe they should be in -> > > the XLOG, because we never need to keep the pre-images after we're sure -> > > the buffer has made it to the disk. Instead, we should have the buffer -> > > IO routines implement ping-pong writes of some kind if we want -> > > protection from partial writes. -> > -> > Ping-pong writes to where? We have to fsync, and rather than fsync that -> > area and WAL, we just do WAL. Not sure about a win there. - -Presumably the win is that, "we never need to keep the pre-images -after we're sure the buffer has made it to the disk." So the -pre-image log can be completely ditched when we shut down the -server, so a full system sync, or whatever. This keeps the log file -size down, which means faster recovery, less to back up (when we -start getting transaction logs that can be backed up), etc. - -This should also allow us to disable completely the ping-pong writes -if we have a disk subsystem that we trust. (E.g., a disk array with -battery backed memory.) That would, in theory, produce a nice little -performance increase when lots of inserts and/or updates are being -committed, as we have much, much less to write to the log file. - -Are there stats that track, e.g., the bandwidth of writes to the -log file? I'd be interested in knowing just what kind of savings -one might see by doing this. - -> The key question is: do we have some method to ensure that the OS -> doesn't do the writes in parallel?... -> It would seem to me that if the same process does both ping-pong writes, -> then there should be a cheap way to enforce a serial order. I could be -> wrong though. - -Well, whether or not there's a cheap way depends on whether you consider -fsync to be cheap. :-) - -cjs --- -Curt Sampson +81 90 7737 2974 http://www.netbsd.org - Don't you know, in this new Dark Age, we're all light. --XTC - - -From pgsql-hackers-owner+M24073@postgresql.org Sun Jun 23 11:19:59 2002 -Return-path: -Received: from postgresql.org (postgresql.org [64.49.215.8]) - by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5NFJxF19785 - for ; Sun, 23 Jun 2002 11:19:59 -0400 (EDT) -Received: from localhost.localdomain (postgresql.org [64.49.215.8]) - by localhost (Postfix) with ESMTP - id 0BD5B475E79; Sun, 23 Jun 2002 11:19:55 -0400 (EDT) -Received: from postgresql.org (postgresql.org [64.49.215.8]) - by postgresql.org (Postfix) with SMTP - id 5C0CB475D6A; Sun, 23 Jun 2002 11:19:50 -0400 (EDT) -Received: from localhost.localdomain (postgresql.org [64.49.215.8]) - by localhost (Postfix) with ESMTP id E2353475C4B - for ; Sun, 23 Jun 2002 11:19:47 -0400 (EDT) -Received: from sss.pgh.pa.us (unknown [192.204.191.242]) - by postgresql.org (Postfix) with ESMTP id 746F8475AEA - for ; Sun, 23 Jun 2002 11:19:46 -0400 (EDT) -Received: from sss2.sss.pgh.pa.us (tgl@localhost [127.0.0.1]) - by sss.pgh.pa.us (8.11.4/8.11.4) with ESMTP id g5NFJF108464; - Sun, 23 Jun 2002 11:19:15 -0400 (EDT) -To: Curt Sampson -cc: "J. R. Nield" , Bruce Momjian , - Michael Loftis , mlw , - PostgreSQL Hacker -Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE -In-Reply-To: -References: -Comments: In-reply-to Curt Sampson - message dated "Sun, 23 Jun 2002 22:33:07 +0900" -Date: Sun, 23 Jun 2002 11:19:15 -0400 -Message-ID: <8461.1024845555@sss.pgh.pa.us> -From: Tom Lane -Precedence: bulk -Sender: pgsql-hackers-owner@postgresql.org -Status: OR - -Curt Sampson writes: -> This should also allow us to disable completely the ping-pong writes -> if we have a disk subsystem that we trust. - -If we have a disk subsystem we trust, we just disable fsync on the -WAL and the performance issue largely goes away. - -I concur with Bruce: the reason we keep page images in WAL is to -minimize the number of places we have to fsync, and thus the amount of -head movement required for a commit. Putting the page images elsewhere -cannot be a win AFAICS. - -> Well, whether or not there's a cheap way depends on whether you consider -> fsync to be cheap. :-) - -It's never cheap :-( - - regards, tom lane - ----------------------------(end of broadcast)--------------------------- -TIP 5: Have you checked our extensive FAQ? - -http://www.postgresql.org/users-lounge/docs/faq.html - -From cjs@cynic.net Sun Jun 23 12:10:44 2002 -Return-path: -Received: from academic.cynic.net (academic.cynic.net [63.144.177.3]) - by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5NGAgF22907 - for ; Sun, 23 Jun 2002 12:10:43 -0400 (EDT) -Received: from angelic-academic.cvpn.cynic.net (angelic-academic.cvpn.cynic.net [198.73.220.224]) - by academic.cynic.net (Postfix) with ESMTP - id 57BFDF820; Sun, 23 Jun 2002 16:10:35 +0000 (UTC) -Date: Mon, 24 Jun 2002 01:10:26 +0900 (JST) -From: Curt Sampson -To: Tom Lane -cc: "J. R. Nield" , Bruce Momjian , - Michael Loftis , mlw , - PostgreSQL Hacker -Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE -In-Reply-To: <8461.1024845555@sss.pgh.pa.us> -Message-ID: -MIME-Version: 1.0 -Content-Type: TEXT/PLAIN; charset=US-ASCII -Status: OR - -On Sun, 23 Jun 2002, Tom Lane wrote: - -> Curt Sampson writes: -> > This should also allow us to disable completely the ping-pong writes -> > if we have a disk subsystem that we trust. -> -> If we have a disk subsystem we trust, we just disable fsync on the -> WAL and the performance issue largely goes away. - -No, you can't do this. If you don't fsync(), there's no guarantee -that the write ever got out of the computer's buffer cache and to -the disk subsystem in the first place. - -> I concur with Bruce: the reason we keep page images in WAL is to -> minimize the number of places we have to fsync, and thus the amount of -> head movement required for a commit. - -An fsync() does not necessarially cause head movement, or any real -disk writes at all. If you're writing to many external disk arrays, -for example, the fsync() ensures that the data are in the disk array's -non-volatile or UPS-backed RAM, no more. The array might hold the data -for quite some time before it actually writes it to disk. - -But you're right that it's faster, if you're going to write out changed -pages and have have the ping-pong file and the transaction log on the -same disk, just to write out the entire page to the transaction log. - -So what we would really need to implement, if we wanted to be more -efficient with trusted disk subsystems, would be the option of writing -to the log only the changed row or changed part of the row, or writing -the entire changed page. I don't know how hard this would be.... - -> > Well, whether or not there's a cheap way depends on whether you consider -> > fsync to be cheap. :-) -> -> It's never cheap :-( - -Actually, with a good external RAID system with non-volatile RAM, -it's a good two to four orders of magnitude cheaper than writing to a -directly connected disk that doesn't claim the write is complete until -it's physically on disk. I'd say that it qualifies as at least "not -expensive." Not that you want to do it more often than you have to -anyway.... - -cjs --- -Curt Sampson +81 90 7737 2974 http://www.netbsd.org - Don't you know, in this new Dark Age, we're all light. --XTC - - -From jrnield@usol.com Sun Jun 23 13:56:59 2002 -Return-path: -Received: from hades.usol.com (IDENT:root@hades.usol.com [208.232.58.41]) - by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5NHusF00335 - for ; Sun, 23 Jun 2002 13:56:58 -0400 (EDT) -Received: from 04-077.024.popsite.net (04-077.024.popsite.net [216.126.163.77]) - by hades.usol.com (8.11.6/8.11.6) with ESMTP id g5NHunj18549; - Sun, 23 Jun 2002 13:56:49 -0400 -Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE -From: "J. R. Nield" -To: Tom Lane -cc: Curt Sampson , Bruce Momjian , - Michael Loftis , mlw , - PostgreSQL Hacker - -In-Reply-To: <8461.1024845555@sss.pgh.pa.us> -References: - <8461.1024845555@sss.pgh.pa.us> -Content-Type: text/plain -Content-Transfer-Encoding: 7bit -X-Mailer: Ximian Evolution 1.0.3 (1.0.3-6) -Date: 23 Jun 2002 13:57:19 -0400 -Message-ID: <1024855044.1793.414.camel@localhost.localdomain> -MIME-Version: 1.0 -Status: ORr - -On Sun, 2002-06-23 at 11:19, Tom Lane wrote: -> Curt Sampson writes: -> > This should also allow us to disable completely the ping-pong writes -> > if we have a disk subsystem that we trust. -> -> If we have a disk subsystem we trust, we just disable fsync on the -> WAL and the performance issue largely goes away. - -It wouldn't work because the OS buffering interferes, and we need those -WAL records on disk up to the greatest LSN of the Buffer we will be writing. - - -We already buffer WAL ourselves. We also already buffer regular pages. -Whenever we write a Buffer out of the buffer cache, it is because we -really want that page on disk and wanted to start an IO. If thats not -the case, then we should have more block buffers! - -So since we have all this buffering designed especially to meet our -needs, and since the OS buffering is in the way, can someone explain to -me why postgresql would ever open a file without the O_DSYNC flag if the -platform supports it? - - - -> -> I concur with Bruce: the reason we keep page images in WAL is to -> minimize the number of places we have to fsync, and thus the amount of -> head movement required for a commit. Putting the page images elsewhere -> cannot be a win AFAICS. - - -Why not put all the page images in a single pre-allocated file and treat -it as a ring? How could this be any worse than flushing them in the WAL -log? - -Maybe fsync would be slower with two files, but I don't see how -fdatasync would be, and most platforms support that. - -What would improve performance would be to have a dbflush process that -would work in the background flushing buffers in groups and trying to -stay ahead of ReadBuffer requests. That would let you do the temporary -side of the ping-pong as a huge O_DSYNC writev(2) request (or -fdatasync() once) and then write out the other buffers. It would also -tend to prevent the other backends from blocking on write requests. - -A dbflush could also support aio_read/aio_write on platforms like -Solaris and WindowsNT that support it. - -Am I correct that right now, buffers only get written when they get -removed from the free list for reuse? So a released dirty buffer will -sit in the buffer free list until it becomes the Least Recently Used -buffer, and will then cause a backend to block for IO in a call to -BufferAlloc? - -This would explain why we like using the OS buffer cache, and why our -performance is troublesome when we have to do synchronous IO writes, and -why fsync() takes so long to complete. All of the backends block for -each call to BufferAlloc() after a large table update by a single -backend, and then the OS buffers are always full of our "written" data. - -Am I reading the bufmgr code correctly? I already found an imaginary -race condition there once :-) - -;jnield - - -> -> > Well, whether or not there's a cheap way depends on whether you consider -> > fsync to be cheap. :-) -> -> It's never cheap :-( -> --- -J. R. Nield -jrnield@usol.com - - -From cjs@cynic.net Sun Jun 23 14:15:15 2002 -Return-path: -Received: from academic.cynic.net (academic.cynic.net [63.144.177.3]) - by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5NIFEF01698 - for ; Sun, 23 Jun 2002 14:15:15 -0400 (EDT) -Received: from angelic-academic.cvpn.cynic.net (angelic-academic.cvpn.cynic.net [198.73.220.224]) - by academic.cynic.net (Postfix) with ESMTP - id 796E6F820; Sun, 23 Jun 2002 18:15:08 +0000 (UTC) -Date: Mon, 24 Jun 2002 03:15:01 +0900 (JST) -From: Curt Sampson -To: "J. R. Nield" -cc: Tom Lane , Bruce Momjian , - Michael Loftis , mlw , - PostgreSQL Hacker -Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE -In-Reply-To: <1024855044.1793.414.camel@localhost.localdomain> -Message-ID: -MIME-Version: 1.0 -Content-Type: TEXT/PLAIN; charset=US-ASCII -Status: ORr - -On 23 Jun 2002, J. R. Nield wrote: - -> So since we have all this buffering designed especially to meet our -> needs, and since the OS buffering is in the way, can someone explain to -> me why postgresql would ever open a file without the O_DSYNC flag if the -> platform supports it? - -It's more code, if there are platforms out there that don't support -O_DYSNC. (We still have to keep the old fsync code.) On the other hand, -O_DSYNC could save us a disk arm movement over fsync() because it -appears to me that fsync is also going to force a metadata update, which -means that the inode blocks have to be written as well. - -> Maybe fsync would be slower with two files, but I don't see how -> fdatasync would be, and most platforms support that. - -Because, if both files are on the same disk, you still have to move -the disk arm from the cylinder at the current log file write point -to the cylinder at the current ping-pong file write point. And then back -again to the log file write point cylinder. - -In the end, having a ping-pong file as well seems to me unnecessary -complexity, especially when anyone interested in really good -performance is going to buy a disk subsystem that guarantees no -torn pages and thus will want to turn off the ping-pong file writes -entirely, anyway. - -cjs --- -Curt Sampson +81 90 7737 2974 http://www.netbsd.org - Don't you know, in this new Dark Age, we're all light. --XTC - - -From jrnield@usol.com Sun Jun 23 14:14:51 2002 -Return-path: -Received: from hades.usol.com (IDENT:root@hades.usol.com [208.232.58.41]) - by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5NIEnF01649 - for ; Sun, 23 Jun 2002 14:14:50 -0400 (EDT) -Received: from 04-077.024.popsite.net (04-077.024.popsite.net [216.126.163.77]) - by hades.usol.com (8.11.6/8.11.6) with ESMTP id g5NIEkj19287; - Sun, 23 Jun 2002 14:14:46 -0400 -Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE -From: "J. R. Nield" -To: Curt Sampson -cc: Tom Lane , Bruce Momjian , - Michael Loftis , mlw , - PostgreSQL Hacker - -In-Reply-To: -References: -Content-Type: text/plain -Content-Transfer-Encoding: 7bit -X-Mailer: Ximian Evolution 1.0.3 (1.0.3-6) -Date: 23 Jun 2002 14:15:17 -0400 -Message-ID: <1024856120.3054.418.camel@localhost.localdomain> -MIME-Version: 1.0 -Status: OR - -On Sun, 2002-06-23 at 12:10, Curt Sampson wrote: -> -> So what we would really need to implement, if we wanted to be more -> efficient with trusted disk subsystems, would be the option of writing -> to the log only the changed row or changed part of the row, or writing -> the entire changed page. I don't know how hard this would be.... -> -We already log that stuff. The page images are in addition to the -"Logical Changes", so we could just stop logging the page images. - --- -J. R. Nield -jrnield@usol.com - - - - From pgsql-hackers-owner+M24100@postgresql.org Mon Jun 24 13:13:41 2002 Return-path: Received: from postgresql.org (postgresql.org [64.49.215.8]) @@ -864,7 +108,7 @@ Sender: pgsql-hackers-owner@postgresql.org X-Spam-Status: No, hits=-3.4 required=5.0 tests=IN_REP_TO version=2.30 -Status: OR +Status: RO J. R. Nield wrote: > So since we have all this buffering designed especially to meet our @@ -882,301 +126,56 @@ during checkpoint when we discard old WAL files. > > > Why not put all the page images in a single pre-allocated file and treat -> it as a ring? How could this be any worse than flushing them in the WAL -> log? -> -> Maybe fsync would be slower with two files, but I don't see how -> fdatasync would be, and most platforms support that. - -We have fdatasync option for WAL in postgresql.conf. - --- - Bruce Momjian | http://candle.pha.pa.us - pgman@candle.pha.pa.us | (610) 853-3000 - + If your life is a hard drive, | 830 Blythe Avenue - + Christ can be your backup. | Drexel Hill, Pennsylvania 19026 - - - ----------------------------(end of broadcast)--------------------------- -TIP 4: Don't 'kill -9' the postmaster - - - -From pgsql-hackers-owner+M24091@postgresql.org Mon Jun 24 12:54:22 2002 -Return-path: -Received: from postgresql.org (postgresql.org [64.49.215.8]) - by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5OGsMF07208 - for ; Mon, 24 Jun 2002 12:54:22 -0400 (EDT) -Received: from localhost.localdomain (postgresql.org [64.49.215.8]) - by localhost (Postfix) with ESMTP - id 7DB7947679D; Mon, 24 Jun 2002 09:48:51 -0400 (EDT) -Mailbox-Line: From pgman@candle.pha.pa.us Mon Jun 24 09:48:51 2002 -Received: from postgresql.org (postgresql.org [64.49.215.8]) - by postgresql.org (Postfix) with SMTP - id 3FD37476491; Mon, 24 Jun 2002 08:55:34 -0400 (EDT) -Received: from localhost.localdomain (postgresql.org [64.49.215.8]) - by localhost (Postfix) with ESMTP id 2769E4762E3 - for ; Mon, 24 Jun 2002 08:27:39 -0400 (EDT) -Mailbox-Line: From pgman@candle.pha.pa.us Mon Jun 24 08:27:39 2002 -Received: from candle.pha.pa.us (216-55-132-35.dsl.san-diego.abac.net [216.55.132.35]) - by postgresql.org (Postfix) with ESMTP id ED459475C61 - for ; Sun, 23 Jun 2002 15:37:08 -0400 (EDT) -Received: (from pgman@localhost) - by candle.pha.pa.us (8.11.6/8.10.1) id g5NJasa07642; - Sun, 23 Jun 2002 15:36:54 -0400 (EDT) -From: Bruce Momjian -Message-ID: <200206231936.g5NJasa07642@candle.pha.pa.us> -Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE -In-Reply-To: -To: Curt Sampson -Date: Sun, 23 Jun 2002 15:36:54 -0400 (EDT) -cc: "J. R. Nield" , Tom Lane , - Michael Loftis , mlw , - PostgreSQL Hacker -X-Mailer: ELM [version 2.4ME+ PL97 (25)] -MIME-Version: 1.0 -Content-Transfer-Encoding: 7bit -Content-Type: text/plain; charset=US-ASCII -Precedence: bulk -Sender: pgsql-hackers-owner@postgresql.org -X-Spam-Status: No, hits=-3.4 required=5.0 - tests=IN_REP_TO - version=2.30 -Status: OR - -Curt Sampson wrote: -> On 23 Jun 2002, J. R. Nield wrote: -> -> > So since we have all this buffering designed especially to meet our -> > needs, and since the OS buffering is in the way, can someone explain to -> > me why postgresql would ever open a file without the O_DSYNC flag if the -> > platform supports it? -> -> It's more code, if there are platforms out there that don't support -> O_DYSNC. (We still have to keep the old fsync code.) On the other hand, -> O_DSYNC could save us a disk arm movement over fsync() because it -> appears to me that fsync is also going to force a metadata update, which -> means that the inode blocks have to be written as well. - -Again, see postgresql.conf: - -#wal_sync_method = fsync # the default varies across platforms: -# # fsync, fdatasync, open_sync, or open_datasync - -> -> > Maybe fsync would be slower with two files, but I don't see how -> > fdatasync would be, and most platforms support that. -> -> Because, if both files are on the same disk, you still have to move -> the disk arm from the cylinder at the current log file write point -> to the cylinder at the current ping-pong file write point. And then back -> again to the log file write point cylinder. -> -> In the end, having a ping-pong file as well seems to me unnecessary -> complexity, especially when anyone interested in really good -> performance is going to buy a disk subsystem that guarantees no -> torn pages and thus will want to turn off the ping-pong file writes -> entirely, anyway. - -Yes, I don't see writing to two files vs. one to be any win, especially -when we need to fsync both of them. What I would really like is to -avoid the double I/O of writing to WAL and to the data file; improving -that would be a huge win. - --- - Bruce Momjian | http://candle.pha.pa.us - pgman@candle.pha.pa.us | (610) 853-3000 - + If your life is a hard drive, | 830 Blythe Avenue - + Christ can be your backup. | Drexel Hill, Pennsylvania 19026 - - - ----------------------------(end of broadcast)--------------------------- -TIP 4: Don't 'kill -9' the postmaster - - - -From cjs@cynic.net Sun Jun 23 20:09:44 2002 -Return-path: -Received: from academic.cynic.net (academic.cynic.net [63.144.177.3]) - by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5O09hF00630 - for ; Sun, 23 Jun 2002 20:09:43 -0400 (EDT) -Received: from angelic-academic.cvpn.cynic.net (angelic-academic.cvpn.cynic.net [198.73.220.224]) - by academic.cynic.net (Postfix) with ESMTP - id 6F45AF820; Mon, 24 Jun 2002 00:09:38 +0000 (UTC) -Date: Mon, 24 Jun 2002 09:09:30 +0900 (JST) -From: Curt Sampson -To: Bruce Momjian -cc: "J. R. Nield" , Tom Lane , - Michael Loftis , mlw , - PostgreSQL Hacker -Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE -In-Reply-To: <200206231936.g5NJasa07642@candle.pha.pa.us> -Message-ID: -MIME-Version: 1.0 -Content-Type: TEXT/PLAIN; charset=US-ASCII -Status: OR - -On Sun, 23 Jun 2002, Bruce Momjian wrote: - -> Yes, I don't see writing to two files vs. one to be any win, especially -> when we need to fsync both of them. What I would really like is to -> avoid the double I/O of writing to WAL and to the data file; improving -> that would be a huge win. - -You mean, the double I/O of writing the block to the WAL and data file? -(We'd still have to write the changed columns or whatever to the WAL, -right?) - -I'd just add an option to turn it off. If you need it, you need it; -there's no way around that except to buy hardware that is really going -to guarantee your writes (which then means you don't need it). - -cjs --- -Curt Sampson +81 90 7737 2974 http://www.netbsd.org - Don't you know, in this new Dark Age, we're all light. --XTC - - -From jrnield@usol.com Sun Jun 23 21:28:58 2002 -Return-path: -Received: from hades.usol.com (IDENT:root@hades.usol.com [208.232.58.41]) - by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5O1SuF06381 - for ; Sun, 23 Jun 2002 21:28:57 -0400 (EDT) -Received: from 01-072.024.popsite.net (01-072.024.popsite.net [216.126.160.72]) - by hades.usol.com (8.11.6/8.11.6) with ESMTP id g5O1Ssj09303; - Sun, 23 Jun 2002 21:28:55 -0400 -Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE -From: "J. R. Nield" -To: Bruce Momjian -cc: Curt Sampson , Tom Lane , - Michael Loftis - , mlw , - PostgreSQL Hacker - -In-Reply-To: <200206231936.g5NJasa07642@candle.pha.pa.us> -References: <200206231936.g5NJasa07642@candle.pha.pa.us> -Content-Type: text/plain -Content-Transfer-Encoding: 7bit -X-Mailer: Ximian Evolution 1.0.3 (1.0.3-6) -Date: 23 Jun 2002 21:29:23 -0400 -Message-ID: <1024882167.1793.733.camel@localhost.localdomain> -MIME-Version: 1.0 -Status: ORr - -On Sun, 2002-06-23 at 15:36, Bruce Momjian wrote: -> Yes, I don't see writing to two files vs. one to be any win, especially -> when we need to fsync both of them. What I would really like is to -> avoid the double I/O of writing to WAL and to the data file; improving -> that would be a huge win. +> it as a ring? How could this be any worse than flushing them in the WAL +> log? > +> Maybe fsync would be slower with two files, but I don't see how +> fdatasync would be, and most platforms support that. -If is impossible to do what you want. You can not protect against -partial writes without writing pages twice and calling fdatasync between -them while going through a generic filesystem. The best disk array will -not protect you if the operating system does not align block writes to -the structure of the underlying device. Even with raw devices, you need -special support or knowledge of the operating system and/or the disk -device to ensure that each write request will be atomic to the -underlying hardware. - -All other systems rely on the fact that you can recover a damaged file -using the log archive. This means downtime in the rare case, but no data -loss. Until PostgreSQL can do this, then it will not be acceptable for -real critical production use. This is not to knock PostgreSQL, because -it is a very good database system, and clearly the best open-source one. -It even has feature advantages over the commercial systems. But at the -end of the day, unless you have complete understanding of the I/O system -from write(2) through to the disk system, the only sure ways to protect -against partial writes are by "careful writes" (in the WAL log or -elsewhere, writing pages twice), or by requiring (and allowing) users to -do log-replay recovery when a file is corrupted by a partial write. As -long as there is a UPS, and the operating system doesn't crash, then -there still should be no partial writes. - -If we log pages to WAL, they are useless when archived (after a -checkpoint). So either we have a separate "log" for them (the ping-pong -file), or we should at least remove them when archived, which makes log -archiving more complex but is perfectly doable. - -Finally, I would love to hear why we are using the operating system -buffer manager at all. The OS is acting as a secondary buffer manager -for us. Why is that? What flaw in our I/O system does this reveal? I -know that: - ->We sync only WAL, not the other pages, except for the sync() call we do -> during checkpoint when we discard old WAL files. - -But this is probably not a good thing. We should only be writing blocks -when they need to be on disk. We should not be expecting the OS to write -them "sometime later" and avoid blocking (as long) for the write. If we -need that, then our buffer management is wrong and we need to fix it. -The reason we are doing this is because we expect the OS buffer manager -to do asynchronous I/O for us, but then we don't control the order. That -is the reason why we have to call fdatasync(), to create "sequence -points". - -The reason we have performance problems with either D_OSYNC or fdatasync -on the normal relations is because we have no dbflush process. This -causes an unacceptable amount of I/O blocking by other transactions. - -The ORACLE people were not kidding when they said that they could not -certify Linux for production use until it supported O_DSYNC. Can you -explain why that was the case? - -Finally, let me apologize if the above comes across as somewhat -belligerent. I know very well that I can't compete with you guys for -knowledge of the PosgreSQL system. I am still at a loss when I look at -the optimizer and executor modules, and it will take some time before I -can follow discussion of that area. Even then, I doubt my ability to -compare with people like Mr. Lane and Mr. Momjian in experience and -general intelligence, or in the field of database programming and -software development in particular. However, this discussion and a -search of the pgsql-hackers archives reveals this problem to be the KEY -area of PostgreSQL's failing, and general misunderstanding, when -compared to its commercial competitors. - -Sincerely, - - J. R. Nield +We have fdatasync option for WAL in postgresql.conf. -- -J. R. Nield -jrnield@usol.com + Bruce Momjian | http://candle.pha.pa.us + pgman@candle.pha.pa.us | (610) 853-3000 + + If your life is a hard drive, | 830 Blythe Avenue + + Christ can be your backup. | Drexel Hill, Pennsylvania 19026 + +---------------------------(end of broadcast)--------------------------- +TIP 4: Don't 'kill -9' the postmaster -From pgsql-hackers-owner+M24090@postgresql.org Mon Jun 24 12:38:04 2002 -Return-path: + +From pgsql-hackers-owner+M24091@postgresql.org Mon Jun 24 12:54:22 2002 +Return-path: Received: from postgresql.org (postgresql.org [64.49.215.8]) - by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5OGc3F05962 - for ; Mon, 24 Jun 2002 12:38:03 -0400 (EDT) + by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5OGsMF07208 + for ; Mon, 24 Jun 2002 12:54:22 -0400 (EDT) Received: from localhost.localdomain (postgresql.org [64.49.215.8]) by localhost (Postfix) with ESMTP - id 81B9F4768DF; Mon, 24 Jun 2002 10:18:05 -0400 (EDT) -Mailbox-Line: From pgman@candle.pha.pa.us Mon Jun 24 10:18:05 2002 + id 7DB7947679D; Mon, 24 Jun 2002 09:48:51 -0400 (EDT) +Mailbox-Line: From pgman@candle.pha.pa.us Mon Jun 24 09:48:51 2002 Received: from postgresql.org (postgresql.org [64.49.215.8]) by postgresql.org (Postfix) with SMTP - id 81F08476473; Mon, 24 Jun 2002 08:55:28 -0400 (EDT) + id 3FD37476491; Mon, 24 Jun 2002 08:55:34 -0400 (EDT) Received: from localhost.localdomain (postgresql.org [64.49.215.8]) - by localhost (Postfix) with ESMTP id CDDFA475CC3 - for ; Mon, 24 Jun 2002 08:37:44 -0400 (EDT) -Mailbox-Line: From pgman@candle.pha.pa.us Mon Jun 24 08:37:44 2002 + by localhost (Postfix) with ESMTP id 2769E4762E3 + for ; Mon, 24 Jun 2002 08:27:39 -0400 (EDT) +Mailbox-Line: From pgman@candle.pha.pa.us Mon Jun 24 08:27:39 2002 Received: from candle.pha.pa.us (216-55-132-35.dsl.san-diego.abac.net [216.55.132.35]) - by postgresql.org (Postfix) with ESMTP id 5C971475858 - for ; Sun, 23 Jun 2002 22:47:12 -0400 (EDT) + by postgresql.org (Postfix) with ESMTP id ED459475C61 + for ; Sun, 23 Jun 2002 15:37:08 -0400 (EDT) Received: (from pgman@localhost) - by candle.pha.pa.us (8.11.6/8.10.1) id g5O2ki712992; - Sun, 23 Jun 2002 22:46:44 -0400 (EDT) + by candle.pha.pa.us (8.11.6/8.10.1) id g5NJasa07642; + Sun, 23 Jun 2002 15:36:54 -0400 (EDT) From: Bruce Momjian -Message-ID: <200206240246.g5O2ki712992@candle.pha.pa.us> +Message-ID: <200206231936.g5NJasa07642@candle.pha.pa.us> Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE -In-Reply-To: <1024882167.1793.733.camel@localhost.localdomain> -To: "J. R. Nield" -Date: Sun, 23 Jun 2002 22:46:44 -0400 (EDT) -cc: Curt Sampson , Tom Lane , +In-Reply-To: +To: Curt Sampson +Date: Sun, 23 Jun 2002 15:36:54 -0400 (EDT) +cc: "J. R. Nield" , Tom Lane , Michael Loftis , mlw , PostgreSQL Hacker X-Mailer: ELM [version 2.4ME+ PL97 (25)] @@ -1188,119 +187,46 @@ Sender: pgsql-hackers-owner@postgresql.org X-Spam-Status: No, hits=-3.4 required=5.0 tests=IN_REP_TO version=2.30 -Status: OR +Status: RO -J. R. Nield wrote: -> On Sun, 2002-06-23 at 15:36, Bruce Momjian wrote: -> > Yes, I don't see writing to two files vs. one to be any win, especially -> > when we need to fsync both of them. What I would really like is to -> > avoid the double I/O of writing to WAL and to the data file; improving -> > that would be a huge win. -> > +Curt Sampson wrote: +> On 23 Jun 2002, J. R. Nield wrote: > -> If is impossible to do what you want. You can not protect against -> partial writes without writing pages twice and calling fdatasync between -> them while going through a generic filesystem. The best disk array will -> not protect you if the operating system does not align block writes to -> the structure of the underlying device. Even with raw devices, you need -> special support or knowledge of the operating system and/or the disk -> device to ensure that each write request will be atomic to the -> underlying hardware. - -Yes, I suspected it was impossible, but that doesn't mean I want it any -less. ;-) - -> All other systems rely on the fact that you can recover a damaged file -> using the log archive. This means downtime in the rare case, but no data -> loss. Until PostgreSQL can do this, then it will not be acceptable for -> real critical production use. This is not to knock PostgreSQL, because -> it is a very good database system, and clearly the best open-source one. -> It even has feature advantages over the commercial systems. But at the -> end of the day, unless you have complete understanding of the I/O system -> from write(2) through to the disk system, the only sure ways to protect -> against partial writes are by "careful writes" (in the WAL log or -> elsewhere, writing pages twice), or by requiring (and allowing) users to -> do log-replay recovery when a file is corrupted by a partial write. As -> long as there is a UPS, and the operating system doesn't crash, then -> there still should be no partial writes. - -You are talking point-in-time recovery, a major missing feature right -next to replication, and I agree it makes PostgreSQL unacceptable for -some applications. Point taken. - -And the interesting thing you are saying is that with point-in-time -recovery, we don't need to write pre-write images of pages because if we -detect a partial page write, we then abort the database and tell the -user to do a point-in-time recovery, basically meaning we are using the -previous full backup as our pre-write page image and roll forward using -the logical logs. This is clearly a nice thing to be able to do because -it let's you take a pre-write image of the page once during full backup, -keep it offline, and bring it back in the rare case of a full page write -failure. I now can see how the MSSQL tearoff-bits would be used, not -for recovery, but to detect a partial write and force a point-in-time -recovery from the administrator. - +> > So since we have all this buffering designed especially to meet our +> > needs, and since the OS buffering is in the way, can someone explain to +> > me why postgresql would ever open a file without the O_DSYNC flag if the +> > platform supports it? +> +> It's more code, if there are platforms out there that don't support +> O_DYSNC. (We still have to keep the old fsync code.) On the other hand, +> O_DSYNC could save us a disk arm movement over fsync() because it +> appears to me that fsync is also going to force a metadata update, which +> means that the inode blocks have to be written as well. -> If we log pages to WAL, they are useless when archived (after a -> checkpoint). So either we have a separate "log" for them (the ping-pong -> file), or we should at least remove them when archived, which makes log -> archiving more complex but is perfectly doable. +Again, see postgresql.conf: -Yes, that is how we will do point-in-time recovery; remove the -pre-write page images and archive the rest. It is more complex, but -having the fsync all in one file is too big a win. +#wal_sync_method = fsync # the default varies across platforms: +# # fsync, fdatasync, open_sync, or open_datasync -> Finally, I would love to hear why we are using the operating system -> buffer manager at all. The OS is acting as a secondary buffer manager -> for us. Why is that? What flaw in our I/O system does this reveal? I -> know that: > -> >We sync only WAL, not the other pages, except for the sync() call we do -> > during checkpoint when we discard old WAL files. +> > Maybe fsync would be slower with two files, but I don't see how +> > fdatasync would be, and most platforms support that. +> +> Because, if both files are on the same disk, you still have to move +> the disk arm from the cylinder at the current log file write point +> to the cylinder at the current ping-pong file write point. And then back +> again to the log file write point cylinder. > -> But this is probably not a good thing. We should only be writing blocks -> when they need to be on disk. We should not be expecting the OS to write -> them "sometime later" and avoid blocking (as long) for the write. If we -> need that, then our buffer management is wrong and we need to fix it. -> The reason we are doing this is because we expect the OS buffer manager -> to do asynchronous I/O for us, but then we don't control the order. That -> is the reason why we have to call fdatasync(), to create "sequence -> points". - -Yes. I think I understand. It is true we have to fsync WAL because we -can't control the individual writes by the OS. - -> The reason we have performance problems with either D_OSYNC or fdatasync -> on the normal relations is because we have no dbflush process. This -> causes an unacceptable amount of I/O blocking by other transactions. - -Uh, that would force writes all over the disk. Why do we really care how -the OS writes them? If we are going to fsync, let's just do the one -file and be done with it. What would a separate flusher process really -buy us if it has to use fsync too. The main backend doesn't have to -wait for the fsync, but then again, we can't say the transaction is -committed until it hits the disk, so how does a flusher help? - -> The ORACLE people were not kidding when they said that they could not -> certify Linux for production use until it supported O_DSYNC. Can you -> explain why that was the case? - -I don't see O_DSYNC as very different from write/fsync(or fdatasync). - -> Finally, let me apologize if the above comes across as somewhat -> belligerent. I know very well that I can't compete with you guys for -> knowledge of the PostgreSQL system. I am still at a loss when I look at -> the optimizer and executor modules, and it will take some time before I -> can follow discussion of that area. Even then, I doubt my ability to -> compare with people like Mr. Lane and Mr. Momjian in experience and -> general intelligence, or in the field of database programming and -> software development in particular. However, this discussion and a -> search of the pgsql-hackers archives reveals this problem to be the KEY -> area of PostgreSQL's failing, and general misunderstanding, when -> compared to its commercial competitors. +> In the end, having a ping-pong file as well seems to me unnecessary +> complexity, especially when anyone interested in really good +> performance is going to buy a disk subsystem that guarantees no +> torn pages and thus will want to turn off the ping-pong file writes +> entirely, anyway. -We appreciate your ideas. Few of us are professional db folks so we are -always looking for good ideas. +Yes, I don't see writing to two files vs. one to be any win, especially +when we need to fsync both of them. What I would really like is to +avoid the double I/O of writing to WAL and to the data file; improving +that would be a huge win. -- Bruce Momjian | http://candle.pha.pa.us @@ -1311,9 +237,7 @@ always looking for good ideas. ---------------------------(end of broadcast)--------------------------- -TIP 6: Have you searched our list archives? - -http://archives.postgresql.org +TIP 4: Don't 'kill -9' the postmaster @@ -1336,7 +260,7 @@ In-Reply-To: <1024882167.1793.733.camel@localhost.localdomain> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII -Status: OR +Status: RO On 23 Jun 2002, J. R. Nield wrote: @@ -1556,7 +480,7 @@ Sender: pgsql-hackers-owner@postgresql.org X-Spam-Status: No, hits=-5.3 required=5.0 tests=IN_REP_TO,X_NOT_PRESENT version=2.30 -Status: OR +Status: RO > On Sun, 23 Jun 2002, Bruce Momjian wrote: >> Yes, I don't see writing to two files vs. one to be any win, especially @@ -1636,7 +560,7 @@ Sender: pgsql-hackers-owner@postgresql.org X-Spam-Status: No, hits=-3.4 required=5.0 tests=IN_REP_TO version=2.30 -Status: OR +Status: RO Tom Lane wrote: > > On Sun, 23 Jun 2002, Bruce Momjian wrote: @@ -1679,417 +603,6 @@ TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org -From pgsql-hackers-owner+M24114@postgresql.org Mon Jun 24 17:54:35 2002 -Return-path: -Received: from postgresql.org (postgresql.org [64.49.215.8]) - by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5OLsZF28642 - for ; Mon, 24 Jun 2002 17:54:35 -0400 (EDT) -Received: from localhost.localdomain (postgresql.org [64.49.215.8]) - by localhost (Postfix) with ESMTP - id BD68F47683C; Mon, 24 Jun 2002 16:46:24 -0400 (EDT) -Mailbox-Line: From tgl@sss.pgh.pa.us Mon Jun 24 16:46:24 2002 -Received: from postgresql.org (postgresql.org [64.49.215.8]) - by postgresql.org (Postfix) with SMTP - id B2719476B31; Mon, 24 Jun 2002 16:01:51 -0400 (EDT) -Received: from localhost.localdomain (postgresql.org [64.49.215.8]) - by localhost (Postfix) with ESMTP id 950004770BC - for ; Mon, 24 Jun 2002 14:59:46 -0400 (EDT) -Mailbox-Line: From tgl@sss.pgh.pa.us Mon Jun 24 14:59:46 2002 -Received: from sss.pgh.pa.us (unknown [192.204.191.242]) - by postgresql.org (Postfix) with ESMTP id A0756475BB7 - for ; Mon, 24 Jun 2002 13:11:41 -0400 (EDT) -Received: from sss2.sss.pgh.pa.us (tgl@localhost [127.0.0.1]) - by sss.pgh.pa.us (8.11.4/8.11.4) with ESMTP id g5OHB1119826; - Mon, 24 Jun 2002 13:11:02 -0400 (EDT) -To: Bruce Momjian -cc: Curt Sampson , "J. R. Nield" , - Michael Loftis , mlw , - PostgreSQL Hacker -Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE -In-Reply-To: <200206241640.g5OGeVY06116@candle.pha.pa.us> -References: <200206241640.g5OGeVY06116@candle.pha.pa.us> -Comments: In-reply-to Bruce Momjian - message dated "Mon, 24 Jun 2002 12:40:31 -0400" -Date: Mon, 24 Jun 2002 13:11:01 -0400 -Message-ID: <19823.1024938661@sss.pgh.pa.us> -From: Tom Lane -Precedence: bulk -Sender: pgsql-hackers-owner@postgresql.org -X-Spam-Status: No, hits=-5.3 required=5.0 - tests=IN_REP_TO,X_NOT_PRESENT - version=2.30 -Status: OR - -Bruce Momjian writes: ->> The only thing I've been able to think of that seems like it might ->> improve matters is to make the WAL writing logic aware of the layout ->> of buffer pages --- specifically, to know that our pages generally ->> contain an uninteresting "hole" in the middle, and not write the hole. ->> Optimistically this might reduce the WAL data volume by something ->> approaching 50%; though pessimistically (if most pages are near full) ->> it wouldn't help much. - -> Good idea. How about putting the page through or TOAST compression -> routine before writing it to WAL? Should be pretty easy and fast and -> doesn't require any knowledge of the page format. - -Easy, maybe, but fast definitely NOT. The compressor is not speedy. -Given that we have to be holding various locks while we build WAL -records, I do not think it's a good idea to add CPU time there. - -Also, compressing already-compressed data is not a win ... - - regards, tom lane - - - ----------------------------(end of broadcast)--------------------------- -TIP 3: if posting/reading through Usenet, please send an appropriate -subscribe-nomail command to majordomo@postgresql.org so that your -message can get through to the mailing list cleanly - - - -From jrnield@usol.com Mon Jun 24 16:49:25 2002 -Return-path: -Received: from hades.usol.com (IDENT:root@hades.usol.com [208.232.58.41]) - by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5OKnNF23393 - for ; Mon, 24 Jun 2002 16:49:24 -0400 (EDT) -Received: from 08-113.024.popsite.net (08-113.024.popsite.net [66.19.4.113]) - by hades.usol.com (8.11.6/8.11.6) with ESMTP id g5OKnHV19100; - Mon, 24 Jun 2002 16:49:18 -0400 -Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE -From: "J. R. Nield" -To: Curt Sampson -cc: Bruce Momjian , Tom Lane , - PostgreSQL Hacker -In-Reply-To: -References: -Content-Type: text/plain -Content-Transfer-Encoding: 7bit -X-Mailer: Ximian Evolution 1.0.3 (1.0.3-6) -Date: 24 Jun 2002 16:49:42 -0400 -Message-ID: <1024951786.1793.865.camel@localhost.localdomain> -MIME-Version: 1.0 -Status: ORr - -On Sun, 2002-06-23 at 23:40, Curt Sampson wrote: -> On 23 Jun 2002, J. R. Nield wrote: -> -> > If is impossible to do what you want. You can not protect against -> > partial writes without writing pages twice and calling fdatasync -> > between them while going through a generic filesystem. -> -> I agree with this. -> -> > The best disk array will not protect you if the operating system does -> > not align block writes to the structure of the underlying device. -> -> This I don't quite understand. Assuming you're using a SCSI drive -> (and this mostly applies to ATAPI/IDE, too), you can do naught but -> align block writes to the structure of the underlying device. When you -> initiate a SCSI WRITE command, you start by telling the device at which -> block to start writing and how many blocks you intend to write. Then you -> start passing the data. -> - -All I'm saying is that the entire postgresql block write must be -converted into exactly one SCSI write command in all cases, and I don't -know a portable way to ensure this. - -> > Even with raw devices, you need special support or knowledge of the -> > operating system and/or the disk device to ensure that each write -> > request will be atomic to the underlying hardware. -> -> Well, so here I guess you're talking about two things: -> -> 1. When you request, say, an 8K block write, will the OS really -> write it to disk in a single 8K or multiple of 8K SCSI write -> command? -> -> 2. Does the SCSI device you're writing to consider these writes to -> be transactional. That is, if the write is interrupted before being -> completed, does the SCSI device guarantee that the partially-sent -> data is not written, and the old data is maintained? And of course, -> does it guarantee that, when it acknowledges a write, that write is -> now in stable storage and will never go away? -> -> Both of these are not hard to guarantee, actually. For a BSD-based OS, -> for example, just make sure that your filesystem block size is the -> same as or a multiple of the database block size. BSD will never write -> anything other than a block or a sequence of blocks to a disk in a -> single SCSI transaction (unless you've got a really odd SCSI driver). -> And for your disk, buy a Baydel or Clarion disk array, or something -> similar. -> -> Given that it's not hard to set up a system that meets these criteria, -> and this is in fact commonly done for database servers, it would seem a -> good idea for postgres to have the option to take advantage of the time -> and money spent and adjust its performance upward appropriately. - -I agree with this. My point was only that you need to know what -guarantees your operating system/hardware combination provides on a -case-by-case basis, and there is no standard way for a program to -discover this. Most system administrators are not going to know this -either, unless databases are their main responsibility. - -> -> > All other systems rely on the fact that you can recover a damaged file -> > using the log archive. -> -> Not exactly. For MS SQL Server, at any rate, if it detects a page tear -> you cannot restore based on the log file alone. You need a full or -> partial backup that includes that entire torn block. -> - -I should have been more specific: you need a backup of the file from -some time ago, plus all the archived logs from then until the current -log sequence number. - -> > This means downtime in the rare case, but no data loss. Until -> > PostgreSQL can do this, then it will not be acceptable for real -> > critical production use. -> -> It seems to me that it is doing this right now. In fact, it's more -> reliable than some commerial systems (such as SQL Server) because it can -> recover from a torn block with just the logfile. - -Again, what I meant to say is that the commercial systems can recover -with an old file backup + logs. How old the backup can be depends only -on how much time you are willing to spend playing the logs forward. So -if you do a full backup once a week, and multiplex and backup the logs, -then even if a backup tape gets destroyed you can still survive. It just -takes longer. - -Also, postgreSQL can't recover from any other type of block corruption, -while the commercial systems can. That's what I meant by the "critical -production use" comment, which was sort-of unfair. - -So I would say they are equally reliable for torn pages (but not bad -blocks), and the commercial systems let you trade potential recovery -time for not having to write the blocks twice. You do need to back-up -the log archives though. - -> -> > But at the end of the day, unless you have complete understanding of -> > the I/O system from write(2) through to the disk system, the only sure -> > ways to protect against partial writes are by "careful writes" (in -> > the WAL log or elsewhere, writing pages twice), or by requiring (and -> > allowing) users to do log-replay recovery when a file is corrupted by -> > a partial write. -> -> I don't understand how, without a copy of the old data that was in the -> torn block, you can restore that block from just log file entries. Can -> you explain this to me? Take, as an example, a block with ten tuples, -> only one of which has been changed "recently." (I.e., only that change -> is in the log files.) -> -> -> > If we log pages to WAL, they are useless when archived (after a -> > checkpoint). So either we have a separate "log" for them (the -> > ping-pong file), or we should at least remove them when archived, -> > which makes log archiving more complex but is perfectly doable. -> -> Right. That seems to me a better option, since we've now got only one -> write point on the disk rather than two. - -OK. I agree with this now. - -> -> > Finally, I would love to hear why we are using the operating system -> > buffer manager at all. The OS is acting as a secondary buffer manager -> > for us. Why is that? What flaw in our I/O system does this reveal? -> -> It's acting as a "second-level" buffer manager, yes, but to say it's -> "secondary" may be a bit misleading. On most of the systems I've set -> up, the OS buffer cache is doing the vast majority of the work, and the -> postgres buffering is fairly minimal. -> -> There are some good (and some perhaps not-so-good) reasons to do it this -> way. I'll list them more or less in the order of best to worst: -> -> 1. The OS knows where the blocks physically reside on disk, and -> postgres does not. Therefore it's in the interest of postgresql to -> dispatch write responsibility back to the OS as quickly as possible -> so that the OS can prioritize requests appropriately. Most operating -> systems use an "elevator" algorithm to minimize disk head movement; -> but if the OS does not have a block that it could write while the -> head is "on the way" to another request, it can't write it in that -> head pass. -> -> 2. Postgres does not know about any "bank-switching" tricks for -> mapping more physical memory than it has address space. Thus, on -> 32-bit machines, postgres might be limited to mapping 2 or 3 GB of -> memory, even though the machine has, say, 6 GB of physical RAM. The -> OS can use all of the available memory for caching; postgres cannot. -> -> 3. A lot of work has been put into the seek algorithms, read-ahead -> algorithms, block allocation algorithms, etc. in the OS. Why -> duplicate all that work again in postgres? -> -> When you say things like the following: -> -> > We should only be writing blocks when they need to be on disk. We -> > should not be expecting the OS to write them "sometime later" and -> > avoid blocking (as long) for the write. If we need that, then our -> > buffer management is wrong and we need to fix it. -> -> you appear to be making the arugment that we should take the route of -> other database systems, and use raw devices and our own management of -> disk block allocation. If so, you might want first to look back through -> the archives at the discussion I and several others had about this a -> month or two ago. After looking in detail at what NetBSD, at least, does -> in terms of its disk I/O algorithms and buffering, I've pretty much come -> around, at least for the moment, to the attitude that we should stick -> with using the OS. I wouldn't mind seeing postgres be able to manage all -> of this stuff, but it's a *lot* of work for not all that much benefit -> that I can see. - -I'll back off on that. I don't know if we want to use the OS buffer -manager, but shouldn't we try to have our buffer manager group writes -together by files, and pro-actively get them out to disk? Right now, it -looks like all our write requests are delayed as long as possible and -the order in which they are written is pretty-much random, as is the -backend that writes the block, so there is no locality of reference even -when the blocks are adjacent on disk, and the write calls are spread-out -over all the backends. - -Would it not be the case that things like read-ahead, grouping writes, -and caching written data are probably best done by PostgreSQL, because -only our buffer manager can understand when they will be useful or when -they will thrash the cache? - -I may likely be wrong on this, and I haven't done any performance -testing. I shouldn't have brought this up alongside the logging issues, -but there seemed to be some question about whether the OS was actually -doing all these things behind the scene. - - -> -> > The ORACLE people were not kidding when they said that they could not -> > certify Linux for production use until it supported O_DSYNC. Can you -> > explain why that was the case? -> -> I'm suspecting it's because Linux at the time had no raw devices, so -> O_DSYNC was the only other possible method of making sure that disk -> writes actually got to disk. -> -> You certainly don't want to use O_DSYNC if you can use another method, -> because O_DSYNC still goes through the the operating system's buffer -> cache, wasting memory and double-caching things. If you're doing your -> own management, you need either to use a raw device or open files with -> the flag that indicates that the buffer cache should not be used at all -> for reads from and writes to that file. - -Would O_DSYNC|O_RSYNC turn off the cache? - -> -> > However, this discussion and a search of the pgsql-hackers archives -> > reveals this problem to be the KEY area of PostgreSQL's failing, and -> > general misunderstanding, when compared to its commercial competitors. -> -> No, I think it's just that you're under a few minor misapprehensions -> here about what postgres and the OS are actually doing. As I said, I -> went through this whole exact argument a month or two ago, on this very -> list, and I came around to the idea that what postgres is doing now -> works quite well, at least on NetBSD. (Most other OSes have disk I/O -> algorithms that are pretty much as good or better.) There might be a -> very slight advantage to doing all one's own I/O management, but it's -> a huge amount of work, and I think that much effort could be much more -> usefully applied to other areas. - -I will look for that discussion in the archives. - -The logging issue is a key one I think. At least I would be very nervous -as a DBA if I were running a system where any damaged file would cause -data loss. - -Does anyone know what the major barriers to infinite log replay are in -PostgreSQL? I'm trying to look for everything that might need to be -changed outside xlog.c, but surely this has come up before. Searching -the archives hasn't revealed much. - - - -As to the I/O issue: - -Since you know a lot about NetBSD internals, I'd be interested in -hearing about what postgresql looks like to the NetBSD buffer manager. -Am I right that strings of successive writes get randomized? What do our -cache-hit percentages look like? I'm going to do some experimenting with -this. - -> -> Just as a side note, I've been a NetBSD developer since about '96, -> and have been delving into the details of OS design since well before -> that time, so I'm coming to this with what I hope is reasonably good -> knowledge of how disks work and how operating systems use them. (Not -> that this should stop you from pointing out holes in my arguments. :-)) -> - -This stuff is very difficult to get right. Glad to know you follow this -list. - - -> cjs -> -- -> Curt Sampson +81 90 7737 2974 http://www.netbsd.org -> Don't you know, in this new Dark Age, we're all light. --XTC -> --- -J. R. Nield -jrnield@usol.com - - - - -From tgl@sss.pgh.pa.us Mon Jun 24 17:16:06 2002 -Return-path: -Received: from sss.pgh.pa.us (root@[192.204.191.242]) - by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5OLG5F25284 - for ; Mon, 24 Jun 2002 17:16:05 -0400 (EDT) -Received: from sss2.sss.pgh.pa.us (tgl@localhost [127.0.0.1]) - by sss.pgh.pa.us (8.11.4/8.11.4) with ESMTP id g5OLG2121379; - Mon, 24 Jun 2002 17:16:02 -0400 (EDT) -To: "J. R. Nield" -cc: Curt Sampson , Bruce Momjian , - PostgreSQL Hacker -Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE -In-Reply-To: <1024951786.1793.865.camel@localhost.localdomain> -References: <1024951786.1793.865.camel@localhost.localdomain> -Comments: In-reply-to "J. R. Nield" - message dated "24 Jun 2002 16:49:42 -0400" -Date: Mon, 24 Jun 2002 17:16:01 -0400 -Message-ID: <21376.1024953361@sss.pgh.pa.us> -From: Tom Lane -Status: OR - -"J. R. Nield" writes: -> Also, postgreSQL can't recover from any other type of block corruption, -> while the commercial systems can. - -Say again? - -> Would it not be the case that things like read-ahead, grouping writes, -> and caching written data are probably best done by PostgreSQL, because -> only our buffer manager can understand when they will be useful or when -> they will thrash the cache? - -I think you have been missing the point. No one denies that there will -be some incremental gain if we do all that. However, the conclusion of -everyone who has thought much about it (and I see Curt has joined that -group) is that the effort would be far out of proportion to the probable -gain. There are a lot of other things we desperately need to spend time -on that would not amount to re-engineering large quantities of OS-level -code. Given that most Unixen have perfectly respectable disk management -subsystems, we prefer to tune our code to make use of that stuff, rather -than follow the "conventional wisdom" that databases need to bypass it. - -Oracle can afford to do that sort of thing because they have umpteen -thousand developers available. Postgres does not. - - regards, tom lane - From pgsql-hackers-owner+M24128@postgresql.org Mon Jun 24 22:01:58 2002 Return-path: Received: from postgresql.org (postgresql.org [64.49.215.8]) @@ -2129,7 +642,7 @@ Sender: pgsql-hackers-owner@postgresql.org X-Spam-Status: No, hits=-3.4 required=5.0 tests=IN_REP_TO version=2.30 -Status: OR +Status: RO J. R. Nield wrote: > > This I don't quite understand. Assuming you're using a SCSI drive @@ -2251,7 +764,7 @@ Comments: In-reply-to Bruce Momjian Date: Mon, 24 Jun 2002 17:31:56 -0400 Message-ID: <21482.1024954316@sss.pgh.pa.us> From: Tom Lane -Status: ORr +Status: ROr Bruce Momjian writes: >> Does anyone know what the major barriers to infinite log replay are in @@ -2315,7 +828,7 @@ Sender: pgsql-hackers-owner@postgresql.org X-Spam-Status: No, hits=-5.3 required=5.0 tests=IN_REP_TO,X_NOT_PRESENT version=2.30 -Status: OR +Status: RO Bruce Momjian writes: >> Does anyone know what the major barriers to infinite log replay are in @@ -2387,7 +900,7 @@ Sender: pgsql-hackers-owner@postgresql.org X-Spam-Status: No, hits=-3.4 required=5.0 tests=IN_REP_TO version=2.30 -Status: OR +Status: RO Tom Lane wrote: > Bruce Momjian writes: @@ -2475,7 +988,7 @@ Sender: pgsql-hackers-owner@postgresql.org X-Spam-Status: No, hits=-3.4 required=5.0 tests=IN_REP_TO version=2.30 -Status: OR +Status: RO On Mon, 2002-06-24 at 17:16, Tom Lane wrote: @@ -2522,7 +1035,7 @@ X-Mailer: Ximian Evolution 1.0.3 (1.0.3-6) Date: 24 Jun 2002 20:28:00 -0400 Message-ID: <1024964884.3031.876.camel@localhost.localdomain> MIME-Version: 1.0 -Status: OR +Status: RO On Mon, 2002-06-24 at 17:16, Tom Lane wrote: @@ -2537,50 +1050,6 @@ jrnield@usol.com -From cjs@cynic.net Mon Jun 24 23:32:23 2002 -Return-path: -Received: from academic.cynic.net ([63.144.177.3]) - by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5P3WMF28287 - for ; Mon, 24 Jun 2002 23:32:23 -0400 (EDT) -Received: from angelic-academic.cvpn.cynic.net (angelic-academic.cvpn.cynic.net [198.73.220.224]) - by academic.cynic.net (Postfix) with ESMTP - id 28AB5F820; Tue, 25 Jun 2002 03:32:08 +0000 (UTC) -Date: Tue, 25 Jun 2002 12:32:05 +0900 (JST) -From: Curt Sampson -To: "J. R. Nield" -cc: Bruce Momjian , Tom Lane , - PostgreSQL Hacker -Subject: Re: [HACKERS] Index Scans become Seq Scans after VACUUM ANALYSE -In-Reply-To: <1024951786.1793.865.camel@localhost.localdomain> -Message-ID: -MIME-Version: 1.0 -Content-Type: TEXT/PLAIN; charset=US-ASCII -Status: OR - -On 24 Jun 2002, J. R. Nield wrote: - -> All I'm saying is that the entire postgresql block write must be -> converted into exactly one SCSI write command in all cases, and I don't -> know a portable way to ensure this. - -No, there's no portable way. All you can do is give the admin who -is able to set things up safely the ability to turn of the now-unneeded -(and expensive) safety-related stuff that postgres does. - -> I agree with this. My point was only that you need to know what -> guarantees your operating system/hardware combination provides on a -> case-by-case basis, and there is no standard way for a program to -> discover this. Most system administrators are not going to know this -> either, unless databases are their main responsibility. - -Certainly this is true of pretty much every database system out there. - -cjs --- -Curt Sampson +81 90 7737 2974 http://www.netbsd.org - Don't you know, in this new Dark Age, we're all light. --XTC - - From cjs@cynic.net Tue Jun 25 01:09:02 2002 Return-path: Received: from academic.cynic.net (academic.cynic.net [63.144.177.3]) @@ -2599,7 +1068,7 @@ In-Reply-To: <21376.1024953361@sss.pgh.pa.us> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII -Status: ORr +Status: ROr On Mon, 24 Jun 2002, Tom Lane wrote: @@ -2662,7 +1131,7 @@ Sender: pgsql-hackers-owner@postgresql.org X-Spam-Status: No, hits=-3.4 required=5.0 tests=IN_REP_TO version=2.30 -Status: OR +Status: RO Curt Sampson wrote: > On Mon, 24 Jun 2002, Tom Lane wrote: @@ -2735,7 +1204,7 @@ X-Virus-Scanned: by AMaViS new-20020517 Precedence: bulk Sender: pgsql-hackers-owner@postgresql.org X-Virus-Scanned: by AMaViS new-20020517 -Status: ORr +Status: ROr I have been experimenting with empirical tests of file system and device level writes to determine the actual constraints in order to speed up the WAL @@ -2953,7 +1422,7 @@ X-Mailing-List: pgsql-hackers Precedence: bulk Sender: pgsql-hackers-owner@postgresql.org X-Virus-Scanned: by amavisd-new at hub.org -Status: OR +Status: RO > One of the things I was thinking about was whether we could use up those