From: Bruce Momjian Date: Tue, 9 Nov 2004 03:57:16 +0000 (+0000) Subject: Remove mmap TODO.detail, now in TODO. X-Git-Tag: REL8_0_0BETA5~103 X-Git-Url: https://granicus.if.org/sourcecode?a=commitdiff_plain;h=d9ab482abbb18cca6b6cb32dc46700431b208cb6;p=postgresql Remove mmap TODO.detail, now in TODO. --- diff --git a/doc/TODO.detail/mmap b/doc/TODO.detail/mmap deleted file mode 100644 index 58a549ef58..0000000000 --- a/doc/TODO.detail/mmap +++ /dev/null @@ -1,2408 +0,0 @@ -From pgsql-hackers-owner+M5149@postgresql.org Mon Feb 26 03:32:49 2001 -Received: from mail.postgresql.org (webmail.postgresql.org [216.126.85.28]) - by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id DAA04497 - for ; Mon, 26 Feb 2001 03:32:48 -0500 (EST) -Received: from mail.postgresql.org (webmail.postgresql.org [216.126.85.28]) - by mail.postgresql.org (8.11.1/8.11.1) with SMTP id f1Q8TSx48319; - Mon, 26 Feb 2001 03:29:28 -0500 (EST) - (envelope-from pgsql-hackers-owner+M5149@postgresql.org) -Received: from store.d.zembu.com (nat.zembu.com [209.128.96.253]) - by mail.postgresql.org (8.11.1/8.11.1) with ESMTP id f1Q8LPx47243 - for ; Mon, 26 Feb 2001 03:21:25 -0500 (EST) - (envelope-from ncm@zembu.com) -Received: by store.d.zembu.com (Postfix, from userid 509) - id 58E39A782; Mon, 26 Feb 2001 00:21:25 -0800 (PST) -Date: Mon, 26 Feb 2001 00:21:25 -0800 -To: pgsql-hackers@postgresql.org -Subject: Re: [HACKERS] Re: [PATCHES] A patch for xlog.c -Message-ID: <20010226002125.A2430@store.zembu.com> -Reply-To: pgsql-hackers@postgresql.org -References: <200102260200.VAA17397@candle.pha.pa.us> <22318.983161726@sss.pgh.pa.us> -Mime-Version: 1.0 -Content-Type: text/plain; charset=us-ascii -Content-Disposition: inline -User-Agent: Mutt/1.2.5i -In-Reply-To: <22318.983161726@sss.pgh.pa.us>; from tgl@sss.pgh.pa.us on Sun, Feb 25, 2001 at 11:28:46PM -0500 -From: ncm@zembu.com (Nathan Myers) -Precedence: bulk -Sender: pgsql-hackers-owner@postgresql.org -Status: ORr - -On Sun, Feb 25, 2001 at 11:28:46PM -0500, Tom Lane wrote: -> Bruce Momjian writes: -> > It allows no backing store on disk. - -I.e. it allows you to map memory without an associated inode; the memory -may still be swapped. Of course, there is no problem with mapping an -inode too, so that unrelated processes can join in. Solarix has a flag -to pin the shared pages in RAM so they can't be swapped out. - -> > It is the BSD solution to SysV -> > share memory. Here are all the BSDi flags: -> -> > MAP_ANON Map anonymous memory not associated with any specific -> > file. The file descriptor used for creating MAP_ANON -> > must be -1. The offset parameter is ignored. -> -> Hmm. Now that I read down to the "nonstandard extensions" part of the -> HPUX man page for mmap(), I find -> -> If MAP_ANONYMOUS is set in flags: -> -> o A new memory region is created and initialized to all zeros. -> This memory region can be shared only with descendants of -> the current process. - -This is supported on Linux and BSD, but not on Solarix 7. It's not -necessary; you can just map /dev/zero on SysV systems that don't -have MAP_ANON. - -> While I've said before that I don't think it's really necessary for -> processes that aren't children of the postmaster to access the shared -> memory, I'm not sure that I want to go over to a mechanism that makes it -> *impossible* for that to be done. Especially not if the only motivation -> is to avoid having to configure the kernel's shared memory settings. - -There are enormous advantages to avoiding the need to configure kernel -settings. It makes PG a better citizen. PG is much easier to drop in -and use if you don't need attention from the IT department. - -But I don't know of any reason to avoid mapping an actual inode, -so using mmap doesn't necessarily mean giving up sharing among -unrelated processes. - -> Besides, what makes you think there's not a limit on the size of shmem -> allocatable via mmap()? - -I've never seen any mmap limit documented. Since mmap() is how -everybody implements shared libraries, such a limit would be equivalent -to a limit on how much/many shared libraries are used. mmap() with -MAP_ANONYMOUS (or its SysV /dev/zero equivalent) is a common, modern -way to get raw storage for malloc(), so such a limit would be a limit -on malloc() too. - -The mmap architecture comes to us from the Mach microkernel memory -manager, backported into BSD and then copied widely. Since it was -the fundamental mechanism for all memory operations in Mach, arbitrary -limits would make no sense. That it worked so well is the reason it -was copied everywhere else, so adding arbitrary limits while copying -it would be silly. I don't think we'll see any systems like that. - -Nathan Myers -ncm@zembu.com - -From pgsql-hackers-owner+M6138@postgresql.org Mon Mar 19 07:57:59 2001 -Received: from mail.postgresql.org (webmail.postgresql.org [216.126.85.28]) - by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id HAA26926 - for ; Mon, 19 Mar 2001 07:57:59 -0500 (EST) -Received: from mail.postgresql.org (webmail.postgresql.org [216.126.85.28]) - by mail.postgresql.org (8.11.1/8.11.1) with SMTP id f2JCug641835; - Mon, 19 Mar 2001 07:56:42 -0500 (EST) - (envelope-from pgsql-hackers-owner+M6138@postgresql.org) -Received: from fw.wintelcom.net (ns1.wintelcom.net [209.1.153.20]) - by mail.postgresql.org (8.11.1/8.11.1) with ESMTP id f2JCt7641684 - for ; Mon, 19 Mar 2001 07:55:07 -0500 (EST) - (envelope-from bright@fw.wintelcom.net) -Received: (from bright@localhost) - by fw.wintelcom.net (8.10.0/8.10.0) id f2JCt2325289; - Mon, 19 Mar 2001 04:55:02 -0800 (PST) -Date: Mon, 19 Mar 2001 04:55:01 -0800 -From: Alfred Perlstein -To: Rod Taylor -Cc: Hackers List -Subject: Re: [HACKERS] Fw: [vorbis-dev] ogg123: shared memory by mmap() -Message-ID: <20010319045500.T29888@fw.wintelcom.net> -References: <018301c0b070$16049a40$2205010a@jester> -Mime-Version: 1.0 -Content-Type: text/plain; charset=us-ascii -Content-Disposition: inline -User-Agent: Mutt/1.2.5i -In-Reply-To: <018301c0b070$16049a40$2205010a@jester>; from rod.taylor@inquent.com on Mon, Mar 19, 2001 at 07:28:21AM -0500 -X-all-your-base: are belong to us. -Precedence: bulk -Sender: pgsql-hackers-owner@postgresql.org -Status: ORr - -WOOT WOOT! DANGER WILL ROBINSON! - -> ----- Original Message ----- -> From: "Christian Weisgerber" -> Newsgroups: list.vorbis.dev -> To: -> Sent: Saturday, March 17, 2001 12:01 PM -> Subject: [vorbis-dev] ogg123: shared memory by mmap() -> -> -> > The patch below adds: -> > -> > - acinclude.m4: A new macro A_FUNC_SMMAP to check that sharing -> pages -> > through mmap() works. This is taken from Joerg Schilling's star. -> > - configure.in: A_FUNC_SMMAP -> > - ogg123/buffer.c: If we have a working mmap(), use it to create -> > a region of shared memory instead of using System V IPC. -> > -> > Works on BSD. Should also work on SVR4 and offspring (Solaris), -> > and Linux. - -This is a really bad idea performance wise. Solaris has a special -code path for SYSV shared memory that doesn't require tons of swap -tracking structures per-page/per-process. FreeBSD also has this -optimization (it's off by default, but should work since FreeBSD -4.2 via the sysctl kern.ipc.shm_use_phys=1) - -Both OS's use a trick of making the pages non-pageable, this allows -signifigant savings in kernel space required for each attached -process, as well as the use of large pages which reduce the amount -of TLB faults your processes will incurr. - -Anyhow, if you could make this a runtime option it wouldn't be so -evil, but as a compile time option, it's a really bad idea for -Solaris and FreeBSD. - --- --Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org] - ----------------------------(end of broadcast)--------------------------- -TIP 2: you can get off all lists at once with the unregister command - (send "unregister YourEmailAddressHere" to majordomo@postgresql.org) - -From pgsql-hackers-owner+M6255@postgresql.org Tue Mar 20 18:46:33 2001 -Received: from mail.postgresql.org (webmail.postgresql.org [216.126.85.28]) - by candle.pha.pa.us (8.9.0/8.9.0) with ESMTP id SAA02887 - for ; Tue, 20 Mar 2001 18:46:33 -0500 (EST) -Received: from mail.postgresql.org (webmail.postgresql.org [216.126.85.28]) - by mail.postgresql.org (8.11.3/8.11.1) with SMTP id f2KNjtH22390; - Tue, 20 Mar 2001 18:45:55 -0500 (EST) - (envelope-from pgsql-hackers-owner+M6255@postgresql.org) -Received: from fw.wintelcom.net (ns1.wintelcom.net [209.1.153.20]) - by mail.postgresql.org (8.11.3/8.11.1) with ESMTP id f2KNiFH22033 - for ; Tue, 20 Mar 2001 18:44:15 -0500 (EST) - (envelope-from bright@fw.wintelcom.net) -Received: (from bright@localhost) - by fw.wintelcom.net (8.10.0/8.10.0) id f2KNiAW02417; - Tue, 20 Mar 2001 15:44:10 -0800 (PST) -Date: Tue, 20 Mar 2001 15:44:10 -0800 -From: Alfred Perlstein -To: Bruce Momjian -Cc: Rod Taylor , - Hackers List -Subject: Re: [HACKERS] Fw: [vorbis-dev] ogg123: shared memory by mmap() -Message-ID: <20010320154410.H29888@fw.wintelcom.net> -References: <20010319045500.T29888@fw.wintelcom.net> <200103202210.RAA23981@candle.pha.pa.us> -Mime-Version: 1.0 -Content-Type: text/plain; charset=us-ascii -Content-Disposition: inline -User-Agent: Mutt/1.2.5i -In-Reply-To: <200103202210.RAA23981@candle.pha.pa.us>; from pgman@candle.pha.pa.us on Tue, Mar 20, 2001 at 05:10:33PM -0500 -X-all-your-base: are belong to us. -Precedence: bulk -Sender: pgsql-hackers-owner@postgresql.org -Status: OR - -* Bruce Momjian [010320 14:10] wrote: -> > > > The patch below adds: -> > > > -> > > > - acinclude.m4: A new macro A_FUNC_SMMAP to check that sharing -> > > pages -> > > > through mmap() works. This is taken from Joerg Schilling's star. -> > > > - configure.in: A_FUNC_SMMAP -> > > > - ogg123/buffer.c: If we have a working mmap(), use it to create -> > > > a region of shared memory instead of using System V IPC. -> > > > -> > > > Works on BSD. Should also work on SVR4 and offspring (Solaris), -> > > > and Linux. -> > -> > This is a really bad idea performance wise. Solaris has a special -> > code path for SYSV shared memory that doesn't require tons of swap -> > tracking structures per-page/per-process. FreeBSD also has this -> > optimization (it's off by default, but should work since FreeBSD -> > 4.2 via the sysctl kern.ipc.shm_use_phys=1) -> -> > -> > Both OS's use a trick of making the pages non-pageable, this allows -> > signifigant savings in kernel space required for each attached -> > process, as well as the use of large pages which reduce the amount -> > of TLB faults your processes will incurr. -> -> That is interesting. BSDi has SysV shared memory as non-pagable, and I -> always thought of that as a bug. Seems you are saying that having it -> pagable has a significant performance penalty. Interesting. - -Yes, having it pageable is actually sort of bad. - -It doesn't allow you to do several important optimizations. - --- --Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org] - - ----------------------------(end of broadcast)--------------------------- -TIP 4: Don't 'kill -9' the postmaster - -From pgsql-general-owner+M14300@postgresql.org Mon Aug 27 13:07:32 2001 -Return-path: -Received: from server1.pgsql.org (server1.pgsql.org [64.39.15.238]) - by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id f7RH7VF04800 - for ; Mon, 27 Aug 2001 13:07:31 -0400 (EDT) -Received: from postgresql.org.org (webmail.postgresql.org [216.126.85.28]) - by server1.pgsql.org (8.11.6/8.11.6) with ESMTP id f7RH7Tq17721; - Mon, 27 Aug 2001 12:07:29 -0500 (CDT) - (envelope-from pgsql-general-owner+M14300@postgresql.org) -Received: from svana.org (svana.org [210.9.66.30]) - by postgresql.org (8.11.3/8.11.4) with ESMTP id f7RFE1f13269 - for ; Mon, 27 Aug 2001 11:14:01 -0400 (EDT) - (envelope-from kleptog@svana.org) -Received: from kleptog by svana.org with local (Exim 3.12 #1 (Debian)) - id 15bO5x-0000Fd-00; Tue, 28 Aug 2001 01:14:33 +1000 -Date: Tue, 28 Aug 2001 01:14:33 +1000 -From: Martijn van Oosterhout -To: Andrew Snow -cc: pgsql-general@postgresql.org -Subject: Re: [GENERAL] raw partition -Message-ID: <20010828011433.E32309@svana.org> -Reply-To: Martijn van Oosterhout -References: <20010827233815.B32309@svana.org> <000101c12f00$dc5814b0$fa01b5ca@avon> -MIME-Version: 1.0 -Content-Type: text/plain; charset=us-ascii -Content-Disposition: inline -User-Agent: Mutt/1.2.5i -In-Reply-To: <000101c12f00$dc5814b0$fa01b5ca@avon>; from andrew@modulus.org on Tue, Aug 28, 2001 at 12:02:08AM +1000 -Precedence: bulk -Sender: pgsql-general-owner@postgresql.org -Status: OR - -On Tue, Aug 28, 2001 at 12:02:08AM +1000, Andrew Snow wrote: -> -> What I think would be better would be moving postgresql to a system of -> using memory-mapped I/O. instead of the shared buffer cache, files -> would be directly memory-mapped and the OS would do the caching. I -> can't see this happening though because of platform dependancy, but I -> think its worth another look soon because many unix platforms support -> mmap(). I think it would improve the performance of disk-intensive -> tasks noticeably. - -Well, this has other problems. Consider tables that are larger than your -system memory. You'd have to continuously map and unmap different sections. -That can have odd side effects (witness mozilla on linux having 15,000 -mapped areas or so...) - -You would still however get the advantage that you wouldn't have to copy the -data from the disk buffers to user space, you simply get the disk buffer -mapped into your address space. - -I think that for commonly used tables that are under 100K in size (most of -the system tables), this is quite a workable idea. If you don't mind keeping -them mapped the whole time. - --- -Martijn van Oosterhout -http://svana.org/kleptog/ -> It would be nice if someone came up with a certification system that -> actually separated those who can barely regurgitate what they crammed over -> the last few weeks from those who command secret ninja networking powers. - ----------------------------(end of broadcast)--------------------------- -TIP 3: if posting/reading through Usenet, please send an appropriate -subscribe-nomail command to majordomo@postgresql.org so that your -message can get through to the mailing list cleanly - -From pgsql-general-owner+M14319@postgresql.org Mon Aug 27 16:57:10 2001 -Return-path: -Received: from server1.pgsql.org (server1.pgsql.org [64.39.15.238]) - by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id f7RKv9F16849 - for ; Mon, 27 Aug 2001 16:57:09 -0400 (EDT) -Received: from postgresql.org.org (webmail.postgresql.org [216.126.85.28]) - by server1.pgsql.org (8.11.6/8.11.6) with ESMTP id f7RKv9q31456; - Mon, 27 Aug 2001 15:57:09 -0500 (CDT) - (envelope-from pgsql-general-owner+M14319@postgresql.org) -Received: from sss.pgh.pa.us ([192.204.191.242]) - by postgresql.org (8.11.3/8.11.4) with ESMTP id f7RJrsf55472 - for ; Mon, 27 Aug 2001 15:53:54 -0400 (EDT) - (envelope-from tgl@sss.pgh.pa.us) -Received: from sss2.sss.pgh.pa.us (tgl@localhost [127.0.0.1]) - by sss.pgh.pa.us (8.11.4/8.11.4) with ESMTP id f7RJrGK19431; - Mon, 27 Aug 2001 15:53:16 -0400 (EDT) -To: Martijn van Oosterhout -cc: Andrew Snow , pgsql-general@postgresql.org -Subject: Re: [GENERAL] raw partition -In-Reply-To: <20010828011433.E32309@svana.org> -References: <20010827233815.B32309@svana.org> <000101c12f00$dc5814b0$fa01b5ca@avon> <20010828011433.E32309@svana.org> -Comments: In-reply-to Martijn van Oosterhout - message dated "Tue, 28 Aug 2001 01:14:33 +1000" -Date: Mon, 27 Aug 2001 15:53:15 -0400 -Message-ID: <19428.998941995@sss.pgh.pa.us> -From: Tom Lane -Precedence: bulk -Sender: pgsql-general-owner@postgresql.org -Status: OR - -Martijn van Oosterhout writes: -> You would still however get the advantage that you wouldn't have to copy the -> data from the disk buffers to user space, you simply get the disk buffer -> mapped into your address space. - -AFAICS this would be the *only* advantage. While it's not negligible, -it's quite unclear that it's worth the bookkeeping and portability -headaches of managing lots of mmap'd areas, either. - -Before I take this idea seriously at all, I'd want to see a design that -addresses a couple of critical issues: - -1. Postgres' shared buffers are *shared*, potentially across many -processes. How will you deal with buffers for files that have been -mmap'd by only some of the processes? (Maybe this means that the -whole concept of shared buffers goes away, and each process does its -own buffer management based on its own mmaps. Not sure. That would be -a pretty radical restructuring though, and would completely invalidate -our present approach to page-level locking.) - -2. How do you deal with extending a file? My system's mmap man page -says - If the size of the mapped file changes after the call to mmap(), the - effect of references to portions of the mapped region that correspond - to added or removed portions of the file is unspecified. -This suggests that the only portable way to cope is to issue a separate -mmap for every disk page. Will typical Unix systems perform well with -umpteen thousand small mmap requests? - -3. How do you persuade the other backends to drop their mmaps of a table -you are deleting? - -There are probably other gotchas, but without an understanding of how -to address these, I doubt it's worth looking further ... - - regards, tom lane - ----------------------------(end of broadcast)--------------------------- -TIP 5: Have you checked our extensive FAQ? - -http://www.postgresql.org/users-lounge/docs/faq.html - -From pgsql-hackers-owner+M13750=candle.pha.pa.us=pgman@postgresql.org Mon Oct 1 05:59:15 2001 -Return-path: -Received: from server1.pgsql.org (server1.pgsql.org [64.39.15.238] (may be forged)) - by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id f919xF512590 - for ; Mon, 1 Oct 2001 05:59:15 -0400 (EDT) -Received: from postgresql.org (webmail.postgresql.org [216.126.85.28]) - by server1.pgsql.org (8.11.6/8.11.6) with ESMTP id f919xA207817 - for ; Mon, 1 Oct 2001 04:59:10 -0500 (CDT) - (envelope-from pgsql-hackers-owner+M13750=candle.pha.pa.us=pgman@postgresql.org) -Received: from mrsgntmail01.mediaring.com.sg (mserver.mediaring.com.sg [203.208.141.175]) - by postgresql.org (8.11.3/8.11.4) with ESMTP id f919rE320926 - for ; Mon, 1 Oct 2001 05:53:15 -0400 (EDT) - (envelope-from jana-reddy@mediaring.com.sg) -Received: by MRSGNTMAIL01 with Internet Mail Service (5.5.2650.21) - id ; Mon, 1 Oct 2001 18:03:34 +0800 -Received: from mediaring.com.sg (10.1.0.131 [10.1.0.131]) by mrsgntmail01.mediaring.com.sg with SMTP (Microsoft Exchange Internet Mail Service Version 5.5.2650.21) - id PMTCM7SH; Mon, 1 Oct 2001 18:03:25 +0800 -From: Janardhana Reddy -To: Bruce Momjian , Tom Lane -cc: PostgreSQL-development , - janareddy - -Message-ID: <3BB83DF0.8946973@mediaring.com.sg> -Date: Mon, 01 Oct 2001 17:57:04 +0800 -X-Mailer: Mozilla 4.75 [en] (X11; U; Linux 2.4.0 i686) -X-Accept-Language: en -MIME-Version: 1.0 -Subject: Re: [HACKERS] PERFORMANCE IMPROVEMENT by mapping WAL FILES -References: <200109282137.f8SLbpm01890@candle.pha.pa.us> -Content-Type: text/plain; charset=us-ascii -Content-Transfer-Encoding: 7bit -Precedence: bulk -Sender: pgsql-hackers-owner@postgresql.org -Status: ORr - - I have just completed the functional testing the WAL using mmap , it is - - working fine, I have tested by commenting out the "CreateCheckPoint " -functionality so that - when i kill the postgres and restart it will redo all the records from the -WAL log file which - is updated using mmap. - Just i need to clean code and to do some stress testing. - By the end of this week i should able to complete the stress test and -generate the patch file . - As Tom Lane mentioned i see the problem in portability to all platforms, - - what i propose is to use mmap for only WAL for some platforms like - linux,freebsd etc . For other platforms we can use the existing method by -slightly modifying the - write() routine to write only the modified part of the page. - -Regards -jana - -> -> -> OK, I have talked to Tom Lane about this on the phone and we have a few -> ideas. -> -> Historically, we have avoided mmap() because of portability problems, -> and because using mmap() to write to large tables could consume lots of -> address space with little benefit. However, I perhaps can see WAL as -> being a good use of mmap. -> -> First, there is the issue of using mmap(). For OS's that have the -> mmap() MAP_SHARED flag, different backends could mmap the same file and -> each see the changes. However, keep in mind we still have to fsync() -> WAL, so we need to use msync(). -> -> So, looking at the benefits of using mmap(), we have overhead of -> different backends having to mmap something that now sits quite easily -> in shared memory. Now, I can see mmap reducing the copy from user to -> kernel, but there are other ways to fix that. We could modify the -> write() routines to write() 8k on first WAL page write and later write -> only the modified part of the page to the kernel buffers. The old -> kernel buffer is probably still around so it is unlikely to require a -> read from the file system to read in the rest of the page. This reduces -> the write from 8k to something probably less than 4k which is better -> than we can do with mmap. -> -> I will add a TODO item to this effect. -> -> As far as reducing the write to disk from 8k to 4k, if we have to -> fsync/msync, we have to wait for the disk to spin to the proper location -> and at that point writing 4k or 8k doesn't seem like much of a win. -> -> In summary, I think it would be nice to reduce the 8k transfer from user -> to kernel on secondary page writes to only the modified part of the -> page. I am uncertain if mmap() or anything else will help the physical -> write to the disk. -> -> -- -> Bruce Momjian | http://candle.pha.pa.us -> pgman@candle.pha.pa.us | (610) 853-3000 -> + If your life is a hard drive, | 830 Blythe Avenue -> + Christ can be your backup. | Drexel Hill, Pennsylvania 19026 - ----------------------------(end of broadcast)--------------------------- -TIP 6: Have you searched our list archives? - -http://archives.postgresql.org - -From pgsql-hackers-owner+M23388@postgresql.org Mon Jun 3 17:54:43 2002 -Return-path: -Received: from postgresql.org (postgresql.org [64.49.215.8]) - by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g53LsgB05125 - for ; Mon, 3 Jun 2002 17:54:42 -0400 (EDT) -Received: from localhost.localdomain (postgresql.org [64.49.215.8]) - by localhost (Postfix) with ESMTP - id 15421475884; Mon, 3 Jun 2002 17:54:14 -0400 (EDT) -Received: from postgresql.org (postgresql.org [64.49.215.8]) - by postgresql.org (Postfix) with SMTP - id 8B89B4761F0; Mon, 3 Jun 2002 17:53:49 -0400 (EDT) -Received: from localhost.localdomain (postgresql.org [64.49.215.8]) - by localhost (Postfix) with ESMTP id D0F90475ECD - for ; Mon, 3 Jun 2002 17:53:38 -0400 (EDT) -Received: from motgate3.mot.com (motgate3.mot.com [144.189.100.103]) - by postgresql.org (Postfix) with ESMTP id 5CE5147593B - for ; Mon, 3 Jun 2002 17:53:13 -0400 (EDT) -Received: [from pobox.mot.com (pobox.mot.com [129.188.137.100]) by motgate3.mot.com (motgate3 2.1) with ESMTP id OAA22235; Mon, 3 Jun 2002 14:52:44 -0700 (MST)] -Received: [from pronto1.comm.mot.com (pronto1.comm.mot.com [173.6.1.22]) by pobox.mot.com (MOT-pobox 2.0) with ESMTP id OAA19166; Mon, 3 Jun 2002 14:52:59 -0700 (MST)] -Received: from kovalenkoigor (idennt19534 [145.1.195.34]) - by pronto1.comm.mot.com (8.9.3/8.9.3) with SMTP id QAA20419; - Mon, 3 Jun 2002 16:52:57 -0500 (CDT) -Message-ID: <0e0a01c20b49$26e90a00$22c30191@comm.mot.com> -From: "Igor Kovalenko" -To: "Bruce Momjian" -cc: "Tom Lane" , "mlw" , - "Marc G. Fournier" , -References: <200206030047.g530lZi21901@candle.pha.pa.us> -Subject: Re: [HACKERS] HEADS UP: Win32/OS2/BeOS native ports -Date: Mon, 3 Jun 2002 16:53:51 -0500 -MIME-Version: 1.0 -Content-Type: text/plain; - charset="iso-8859-1" -Content-Transfer-Encoding: 7bit -X-Priority: 3 -X-MSMail-Priority: Normal -X-Mailer: Microsoft Outlook Express 5.00.2919.6600 -X-MimeOLE: Produced By Microsoft MimeOLE V5.00.2919.6600 -Precedence: bulk -Sender: pgsql-hackers-owner@postgresql.org -Status: OR - -That's what Apache does. Note, on most platforms MAP_ANON is equivalent to -mmmap-ing /dev/zero. Solaris for example does not provide MAP_ANON but using - -fd=open(/dev/zero) -mmap(fd, ...) -close(fd) - -works just fine. - ------ Original Message ----- -From: "Bruce Momjian" -To: "Igor Kovalenko" -Cc: "Tom Lane" ; "mlw" ; "Marc G. -Fournier" ; -Sent: Sunday, June 02, 2002 7:47 PM -Subject: Re: [HACKERS] HEADS UP: Win32/OS2/BeOS native ports - - -> Igor Kovalenko wrote: -> > It does not have to be anonymous. POSIX also defines shm_open(same -arguments -> > as open) API which will create named object in whatever location -corresponds -> > to shared memory storage on that platform (object is then grown to -needed -> > size by ftruncate() and the fd is then passed to mmap). The object will -> > exist in name space and can be detected by subsequent calls to -shm_open() -> > with same name. It is not really different from doing open(), but more -> > portable (mmap() on regular files may not be supported). -> -> Actually, I think the best shared memory implemention would be -> MAP_ANON | MAP_SHARED mmap(), which could be called from the postmaster -> and passed to child processes. -> -> While all our platforms have mmap(), many don't have MAP_ANON, but those -> that do could use it. You need MAP_ANON to prevent the shared memory -> from being written to a disk file. -> -> -- -> Bruce Momjian | http://candle.pha.pa.us -> pgman@candle.pha.pa.us | (610) 853-3000 -> + If your life is a hard drive, | 830 Blythe Avenue -> + Christ can be your backup. | Drexel Hill, Pennsylvania 19026 -> - - ----------------------------(end of broadcast)--------------------------- -TIP 2: you can get off all lists at once with the unregister command - (send "unregister YourEmailAddressHere" to majordomo@postgresql.org) - -From pgsql-hackers-owner+M24146@postgresql.org Tue Jun 25 02:27:29 2002 -Return-path: -Received: from postgresql.org (postgresql.org [64.49.215.8]) - by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5P6RSF12626 - for ; Tue, 25 Jun 2002 02:27:28 -0400 (EDT) -Received: from localhost.localdomain (postgresql.org [64.49.215.8]) - by localhost (Postfix) with ESMTP - id 2C72F475EF6; Tue, 25 Jun 2002 02:27:28 -0400 (EDT) -Mailbox-Line: From cjs@cynic.net Tue Jun 25 02:27:28 2002 -Received: from postgresql.org (postgresql.org [64.49.215.8]) - by postgresql.org (Postfix) with SMTP - id 42AAB475B26; Tue, 25 Jun 2002 02:07:04 -0400 (EDT) -Received: from localhost.localdomain (postgresql.org [64.49.215.8]) - by localhost (Postfix) with ESMTP id A8D13475A06 - for ; Tue, 25 Jun 2002 02:07:01 -0400 (EDT) -Mailbox-Line: From cjs@cynic.net Tue Jun 25 02:07:01 2002 -Received: from academic.cynic.net (academic.cynic.net [63.144.177.3]) - by postgresql.org (Postfix) with ESMTP id F3C264760A1 - for ; Tue, 25 Jun 2002 01:05:49 -0400 (EDT) -Received: from angelic-academic.cvpn.cynic.net (angelic-academic.cvpn.cynic.net [198.73.220.224]) - by academic.cynic.net (Postfix) with ESMTP - id 5F61CF820; Tue, 25 Jun 2002 05:05:47 +0000 (UTC) -Date: Tue, 25 Jun 2002 14:05:45 +0900 (JST) -From: Curt Sampson -To: "J. R. Nield" -cc: Bruce Momjian , Tom Lane , - PostgreSQL Hacker -Subject: [HACKERS] Buffer Management -In-Reply-To: <1024951786.1793.865.camel@localhost.localdomain> -Message-ID: -MIME-Version: 1.0 -Content-Type: TEXT/PLAIN; charset=US-ASCII -Precedence: bulk -Sender: pgsql-hackers-owner@postgresql.org -X-Spam-Status: No, hits=-5.3 required=5.0 - tests=IN_REP_TO,X_NOT_PRESENT - version=2.30 -Status: OR - -I'm splitting off this buffer mangement stuff into a separate thread. - -On 24 Jun 2002, J. R. Nield wrote: - -> I'll back off on that. I don't know if we want to use the OS buffer -> manager, but shouldn't we try to have our buffer manager group writes -> together by files, and pro-actively get them out to disk? - -The only way the postgres buffer manager can "get [data] out to disk" -is to do an fsync(). For data files (as opposed to log files), this can -only slow down overall system throughput, as this would only disrupt the -OS's write management. - -> Right now, it -> looks like all our write requests are delayed as long as possible and -> the order in which they are written is pretty-much random, as is the -> backend that writes the block, so there is no locality of reference even -> when the blocks are adjacent on disk, and the write calls are spread-out -> over all the backends. - -It doesn't matter. The OS will introduce locality of reference with its -write algorithms. Take a look at - - http://www.cs.wisc.edu/~solomon/cs537/disksched.html - -for an example. Most OSes use the elevator or one-way elevator -algorithm. So it doesn't matter whether it's one back-end or many -writing, and it doesn't matter in what order they do the write. - -> Would it not be the case that things like read-ahead, grouping writes, -> and caching written data are probably best done by PostgreSQL, because -> only our buffer manager can understand when they will be useful or when -> they will thrash the cache? - -Operating systems these days are not too bad at guessing guessing what -you're doing. Pretty much every OS I've seen will do read-ahead when -it detects you're doing sequential reads, at least in the forward -direction. And Solaris is even smart enough to mark the pages you've -read as "not needed" so that they quickly get flushed from the cache, -rather than blowing out your entire cache if you go through a large -file. - -> Would O_DSYNC|O_RSYNC turn off the cache? - -No. I suppose there's nothing to stop it doing so, in some -implementations, but the interface is not designed for direct I/O. - -> Since you know a lot about NetBSD internals, I'd be interested in -> hearing about what postgresql looks like to the NetBSD buffer manager. - -Well, looks like pretty much any program, or group of programs, -doing a lot of I/O. :-) - -> Am I right that strings of successive writes get randomized? - -No; as I pointed out, they in fact get de-randomized as much as -possible. The more proceses you have throwing out requests, the better -the throughput will be in fact. - -> What do our cache-hit percentages look like? I'm going to do some -> experimenting with this. - -Well, that depends on how much memory you have and what your working -set is. :-) - -cjs --- -Curt Sampson +81 90 7737 2974 http://www.netbsd.org - Don't you know, in this new Dark Age, we're all light. --XTC - - - - ----------------------------(end of broadcast)--------------------------- -TIP 6: Have you searched our list archives? - -http://archives.postgresql.org - - - -From cjs@cynic.net Tue Jun 25 09:52:23 2002 -Return-path: -Received: from academic.cynic.net (academic.cynic.net [63.144.177.3]) - by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5PDqKF07478 - for ; Tue, 25 Jun 2002 09:52:22 -0400 (EDT) -Received: from angelic-academic.cvpn.cynic.net (angelic-academic.cvpn.cynic.net [198.73.220.224]) - by academic.cynic.net (Postfix) with ESMTP - id D9242F820; Tue, 25 Jun 2002 13:52:18 +0000 (UTC) -Date: Tue, 25 Jun 2002 22:52:14 +0900 (JST) -From: Curt Sampson -To: "J. R. Nield" -cc: Bruce Momjian , Tom Lane , - PostgreSQL Hacker -Subject: Re: [HACKERS] Buffer Management -In-Reply-To: -Message-ID: -MIME-Version: 1.0 -Content-Type: TEXT/PLAIN; charset=US-ASCII -Status: OR - - -So, while we're at it, what's the current state of people's thinking -on using mmap rather than shared memory for data file buffers? I -see some pretty powerful advantages to this approach, and I'm not -(yet :-)) convinced that the disadvantages are as bad as people think. -I think I can address most of the concerns in doc/TODO.detail/mmap. - -Is this worth pursuing a bit? (I.e., should I spend an hour or two -writing up the advantages and thoughts on how to get around the -problems?) Anybody got objections that aren't in doc/TODO.detail/mmap? - -cjs --- -Curt Sampson +81 90 7737 2974 http://www.netbsd.org - Don't you know, in this new Dark Age, we're all light. --XTC - - -From tgl@sss.pgh.pa.us Tue Jun 25 10:09:07 2002 -Return-path: -Received: from sss.pgh.pa.us (root@[192.204.191.242]) - by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5PE96F08922 - for ; Tue, 25 Jun 2002 10:09:06 -0400 (EDT) -Received: from sss2.sss.pgh.pa.us (tgl@localhost [127.0.0.1]) - by sss.pgh.pa.us (8.11.4/8.11.4) with ESMTP id g5PE92107301; - Tue, 25 Jun 2002 10:09:02 -0400 (EDT) -To: Curt Sampson -cc: "J. R. Nield" , Bruce Momjian , - PostgreSQL Hacker -Subject: Re: [HACKERS] Buffer Management -In-Reply-To: -References: -Comments: In-reply-to Curt Sampson - message dated "Tue, 25 Jun 2002 22:52:14 +0900" -Date: Tue, 25 Jun 2002 10:09:02 -0400 -Message-ID: <7298.1025014142@sss.pgh.pa.us> -From: Tom Lane -Status: ORr - -Curt Sampson writes: -> So, while we're at it, what's the current state of people's thinking -> on using mmap rather than shared memory for data file buffers? - -There seem to be a couple of different threads in doc/TODO.detail/mmap. - -One envisions mmap as a one-for-one replacement for our current use of -SysV shared memory, the main selling point being to get out from under -kernels that don't have SysV support or have it configured too small. -This might be worth doing, and I think it'd be relatively easy to do -now that the shared memory support is isolated in one file and there's -provisions for selecting a shmem implementation at configure time. -The only thing you'd really have to think about is how to replace the -current behavior that uses shmem attach counts to discover whether any -old backends are left over from a previous crashed postmaster. I dunno -if mmap offers any comparable facility. - -The other discussion seemed to be considering how to mmap individual -data files right into backends' address space. I do not believe this -can possibly work, because of loss of control over visibility of data -changes to other backends, timing of write-backs, etc. - -But as long as you stay away from interpretation #2 and go with -mmap-as-a-shmget-substitute, it might be worthwhile. - -(Hey Marc, can one do mmap in a BSD jail?) - - regards, tom lane - -From pgsql-hackers-owner+M24158@postgresql.org Tue Jun 25 10:20:42 2002 -Return-path: -Received: from postgresql.org (postgresql.org [64.49.215.8]) - by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5PEKgF10228 - for ; Tue, 25 Jun 2002 10:20:42 -0400 (EDT) -Received: from localhost.localdomain (postgresql.org [64.49.215.8]) - by localhost (Postfix) with ESMTP - id 7259547609E; Tue, 25 Jun 2002 10:20:35 -0400 (EDT) -Mailbox-Line: From cjs@cynic.net Tue Jun 25 10:20:35 2002 -Received: from postgresql.org (postgresql.org [64.49.215.8]) - by postgresql.org (Postfix) with SMTP - id 8E79647604C; Tue, 25 Jun 2002 10:20:33 -0400 (EDT) -Received: from localhost.localdomain (postgresql.org [64.49.215.8]) - by localhost (Postfix) with ESMTP id C3EB1476002 - for ; Tue, 25 Jun 2002 10:20:30 -0400 (EDT) -Mailbox-Line: From cjs@cynic.net Tue Jun 25 10:20:30 2002 -Received: from academic.cynic.net (academic.cynic.net [63.144.177.3]) - by postgresql.org (Postfix) with ESMTP id 887F9475B2F - for ; Tue, 25 Jun 2002 10:20:16 -0400 (EDT) -Received: from angelic-academic.cvpn.cynic.net (angelic-academic.cvpn.cynic.net [198.73.220.224]) - by academic.cynic.net (Postfix) with ESMTP - id 16CCDF820; Tue, 25 Jun 2002 14:20:19 +0000 (UTC) -Date: Tue, 25 Jun 2002 23:20:15 +0900 (JST) -From: Curt Sampson -To: Tom Lane -cc: "J. R. Nield" , Bruce Momjian , - PostgreSQL Hacker -Subject: Re: [HACKERS] Buffer Management -In-Reply-To: <7298.1025014142@sss.pgh.pa.us> -Message-ID: -MIME-Version: 1.0 -Content-Type: TEXT/PLAIN; charset=US-ASCII -Precedence: bulk -Sender: pgsql-hackers-owner@postgresql.org -X-Spam-Status: No, hits=-5.3 required=5.0 - tests=IN_REP_TO,X_NOT_PRESENT - version=2.30 -Status: OR - -On Tue, 25 Jun 2002, Tom Lane wrote: - -> The only thing you'd really have to think about is how to replace the -> current behavior that uses shmem attach counts to discover whether any -> old backends are left over from a previous crashed postmaster. I dunno -> if mmap offers any comparable facility. - -Sure. Just mmap a file, and it will be persistent. - -> The other discussion seemed to be considering how to mmap individual -> data files right into backends' address space. I do not believe this -> can possibly work, because of loss of control over visibility of data -> changes to other backends, timing of write-backs, etc. - -I don't understand why there would be any loss of visibility of changes. -If two backends mmap the same block of a file, and it's shared, that's -the same block of physical memory that they're accessing. Changes don't -even need to "propagate," because the memory is truly shared. You'd keep -your locks in the page itself as well, of course. - -Can you describe the problem in more detail? - -> But as long as you stay away from interpretation #2 and go with -> mmap-as-a-shmget-substitute, it might be worthwhile. - -It's #2 that I was really looking at. :-) - -cjs --- -Curt Sampson +81 90 7737 2974 http://www.netbsd.org - Don't you know, in this new Dark Age, we're all light. --XTC - - - - ----------------------------(end of broadcast)--------------------------- -TIP 2: you can get off all lists at once with the unregister command - (send "unregister YourEmailAddressHere" to majordomo@postgresql.org) - - - -From pgsql-hackers-owner+M24159@postgresql.org Tue Jun 25 10:25:21 2002 -Return-path: -Received: from postgresql.org (postgresql.org [64.49.215.8]) - by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5PEPKF10831 - for ; Tue, 25 Jun 2002 10:25:20 -0400 (EDT) -Received: from localhost.localdomain (postgresql.org [64.49.215.8]) - by localhost (Postfix) with ESMTP - id AA2EF475C46; Tue, 25 Jun 2002 10:25:13 -0400 (EDT) -Mailbox-Line: From pgman@candle.pha.pa.us Tue Jun 25 10:25:13 2002 -Received: from postgresql.org (postgresql.org [64.49.215.8]) - by postgresql.org (Postfix) with SMTP - id 9657447603B; Tue, 25 Jun 2002 10:23:23 -0400 (EDT) -Received: from localhost.localdomain (postgresql.org [64.49.215.8]) - by localhost (Postfix) with ESMTP id 364D0475FC2 - for ; Tue, 25 Jun 2002 10:23:18 -0400 (EDT) -Mailbox-Line: From pgman@candle.pha.pa.us Tue Jun 25 10:23:18 2002 -Received: from candle.pha.pa.us (216-55-132-35.dsl.san-diego.abac.net [216.55.132.35]) - by postgresql.org (Postfix) with ESMTP id C063F47594B - for ; Tue, 25 Jun 2002 10:20:35 -0400 (EDT) -Received: (from pgman@localhost) - by candle.pha.pa.us (8.11.6/8.10.1) id g5PEKT310222; - Tue, 25 Jun 2002 10:20:29 -0400 (EDT) -From: Bruce Momjian -Message-ID: <200206251420.g5PEKT310222@candle.pha.pa.us> -Subject: Re: [HACKERS] Buffer Management -In-Reply-To: <7298.1025014142@sss.pgh.pa.us> -To: Tom Lane -Date: Tue, 25 Jun 2002 10:20:29 -0400 (EDT) -cc: Curt Sampson , "J. R. Nield" , - PostgreSQL Hacker -X-Mailer: ELM [version 2.4ME+ PL97 (25)] -MIME-Version: 1.0 -Content-Transfer-Encoding: 7bit -Content-Type: text/plain; charset=US-ASCII -Precedence: bulk -Sender: pgsql-hackers-owner@postgresql.org -X-Spam-Status: No, hits=-3.4 required=5.0 - tests=IN_REP_TO - version=2.30 -Status: OR - -Tom Lane wrote: -> Curt Sampson writes: -> > So, while we're at it, what's the current state of people's thinking -> > on using mmap rather than shared memory for data file buffers? -> -> There seem to be a couple of different threads in doc/TODO.detail/mmap. -> -> One envisions mmap as a one-for-one replacement for our current use of -> SysV shared memory, the main selling point being to get out from under -> kernels that don't have SysV support or have it configured too small. -> This might be worth doing, and I think it'd be relatively easy to do -> now that the shared memory support is isolated in one file and there's -> provisions for selecting a shmem implementation at configure time. -> The only thing you'd really have to think about is how to replace the -> current behavior that uses shmem attach counts to discover whether any -> old backends are left over from a previous crashed postmaster. I dunno -> if mmap offers any comparable facility. -> -> The other discussion seemed to be considering how to mmap individual -> data files right into backends' address space. I do not believe this -> can possibly work, because of loss of control over visibility of data -> changes to other backends, timing of write-backs, etc. - -Agreed. Also, there was in intresting thread that mmap'ing /dev/zero is -the same as anonmap for OS's that don't have anonmap. That should cover -most of them. The only downside I can see is that SysV shared memory is -locked into RAM on some/most OS's while mmap anon probably isn't. -Locking in RAM is good in most cases, bad in others. - -This will also work well when we have non-SysV semaphore support, like -Posix semaphores, so we would be able to run with no SysV stuff. - --- - Bruce Momjian | http://candle.pha.pa.us - pgman@candle.pha.pa.us | (610) 853-3000 - + If your life is a hard drive, | 830 Blythe Avenue - + Christ can be your backup. | Drexel Hill, Pennsylvania 19026 - - - ----------------------------(end of broadcast)--------------------------- -TIP 4: Don't 'kill -9' the postmaster - - - -From pgsql-hackers-owner+M24160@postgresql.org Tue Jun 25 10:27:40 2002 -Return-path: -Received: from postgresql.org (postgresql.org [64.49.215.8]) - by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5PEReF11147 - for ; Tue, 25 Jun 2002 10:27:40 -0400 (EDT) -Received: from localhost.localdomain (postgresql.org [64.49.215.8]) - by localhost (Postfix) with ESMTP - id B33CD476047; Tue, 25 Jun 2002 10:27:16 -0400 (EDT) -Mailbox-Line: From lkindness@csl.co.uk Tue Jun 25 10:27:16 2002 -Received: from postgresql.org (postgresql.org [64.49.215.8]) - by postgresql.org (Postfix) with SMTP - id 3091247606D; Tue, 25 Jun 2002 10:23:24 -0400 (EDT) -Received: from localhost.localdomain (postgresql.org [64.49.215.8]) - by localhost (Postfix) with ESMTP id 6C39D476002 - for ; Tue, 25 Jun 2002 10:23:19 -0400 (EDT) -Mailbox-Line: From lkindness@csl.co.uk Tue Jun 25 10:23:19 2002 -Received: from internet.csl.co.uk (internet.csl.co.uk [194.130.52.3]) - by postgresql.org (Postfix) with ESMTP id AC203475C46 - for ; Tue, 25 Jun 2002 10:20:49 -0400 (EDT) -Received: from euphrates.csl.co.uk (host-194-67.csl.co.uk [194.130.52.67]) - by internet.csl.co.uk (8.12.1/8.12.1) with ESMTP id g5PEKonH023514; - Tue, 25 Jun 2002 15:20:50 +0100 -Received: from kelvin.csl.co.uk by euphrates.csl.co.uk (8.9.3/ConceptI 2.4) - id PAA08847; Tue, 25 Jun 2002 15:20:52 +0100 (BST) -Received: by kelvin.csl.co.uk (8.11.6) id g5PEKoT28846; Tue, 25 Jun 2002 15:20:50 +0100 -From: Lee Kindness -MIME-Version: 1.0 -Content-Type: text/plain; charset=us-ascii -Content-Transfer-Encoding: 7bit -Message-ID: <15640.31809.970880.320561@kelvin.csl.co.uk> -Date: Tue, 25 Jun 2002 15:20:49 +0100 -To: Tom Lane -Subject: Re: [HACKERS] Buffer Management -In-Reply-To: <7298.1025014142@sss.pgh.pa.us> -References: - <7298.1025014142@sss.pgh.pa.us> -X-Mailer: VM 7.00 under 21.4 (patch 6) "Common Lisp" XEmacs Lucid -cc: Lee Kindness , pgsql-hackers@postgresql.org -Precedence: bulk -Sender: pgsql-hackers-owner@postgresql.org -X-Spam-Status: No, hits=-3.4 required=5.0 - tests=IN_REP_TO - version=2.30 -Status: OR - -Tom Lane writes: - > There seem to be a couple of different threads in - > doc/TODO.detail/mmap. - > [ snip ] - -A place where mmap could be easily used and would offer a good -performance increase is for COPY FROM. - -Lee. - - - ----------------------------(end of broadcast)--------------------------- -TIP 5: Have you checked our extensive FAQ? - -http://www.postgresql.org/users-lounge/docs/faq.html - - - -From cjs@cynic.net Tue Jun 25 10:24:49 2002 -Return-path: -Received: from academic.cynic.net (academic.cynic.net [63.144.177.3]) - by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5PEOmF10749 - for ; Tue, 25 Jun 2002 10:24:49 -0400 (EDT) -Received: from angelic-academic.cvpn.cynic.net (angelic-academic.cvpn.cynic.net [198.73.220.224]) - by academic.cynic.net (Postfix) with ESMTP - id F2629F820; Tue, 25 Jun 2002 14:24:47 +0000 (UTC) -Date: Tue, 25 Jun 2002 23:24:44 +0900 (JST) -From: Curt Sampson -To: Bruce Momjian -cc: Tom Lane , "J. R. Nield" , - PostgreSQL Hacker -Subject: Re: [HACKERS] Buffer Management -In-Reply-To: <200206251420.g5PEKT310222@candle.pha.pa.us> -Message-ID: -MIME-Version: 1.0 -Content-Type: TEXT/PLAIN; charset=US-ASCII -Status: OR - -On Tue, 25 Jun 2002, Bruce Momjian wrote: - -> The only downside I can see is that SysV shared memory is -> locked into RAM on some/most OS's while mmap anon probably isn't. - -It is if you mlock() it. :-) - -cjs --- -Curt Sampson +81 90 7737 2974 http://www.netbsd.org - Don't you know, in this new Dark Age, we're all light. --XTC - - -From tgl@sss.pgh.pa.us Tue Jun 25 10:29:53 2002 -Return-path: -Received: from sss.pgh.pa.us (root@[192.204.191.242]) - by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5PETpF11341 - for ; Tue, 25 Jun 2002 10:29:52 -0400 (EDT) -Received: from sss2.sss.pgh.pa.us (tgl@localhost [127.0.0.1]) - by sss.pgh.pa.us (8.11.4/8.11.4) with ESMTP id g5PETn107501; - Tue, 25 Jun 2002 10:29:49 -0400 (EDT) -To: Curt Sampson -cc: "J. R. Nield" , Bruce Momjian , - PostgreSQL Hacker -Subject: Re: [HACKERS] Buffer Management -In-Reply-To: -References: -Comments: In-reply-to Curt Sampson - message dated "Tue, 25 Jun 2002 23:20:15 +0900" -Date: Tue, 25 Jun 2002 10:29:49 -0400 -Message-ID: <7498.1025015389@sss.pgh.pa.us> -From: Tom Lane -Status: ORr - -Curt Sampson writes: -> On Tue, 25 Jun 2002, Tom Lane wrote: ->> The other discussion seemed to be considering how to mmap individual ->> data files right into backends' address space. I do not believe this ->> can possibly work, because of loss of control over visibility of data ->> changes to other backends, timing of write-backs, etc. - -> I don't understand why there would be any loss of visibility of changes. -> If two backends mmap the same block of a file, and it's shared, that's -> the same block of physical memory that they're accessing. - -Is it? You have a mighty narrow conception of the range of -implementations that's possible for mmap. - -But the main problem is that mmap doesn't let us control when changes to -the memory buffer will get reflected back to disk --- AFAICT, the OS is -free to do the write-back at any instant after you dirty the page, and -that completely breaks the WAL algorithm. (WAL = write AHEAD log; -the log entry describing a change must hit disk before the data page -change itself does.) - - regards, tom lane - -From pgsql-hackers-owner+M24164@postgresql.org Tue Jun 25 10:44:39 2002 -Return-path: -Received: from postgresql.org (postgresql.org [64.49.215.8]) - by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5PEicF14506 - for ; Tue, 25 Jun 2002 10:44:38 -0400 (EDT) -Received: from localhost.localdomain (postgresql.org [64.49.215.8]) - by localhost (Postfix) with ESMTP - id E20F8476322; Tue, 25 Jun 2002 10:44:27 -0400 (EDT) -Mailbox-Line: From tgl@sss.pgh.pa.us Tue Jun 25 10:44:27 2002 -Received: from postgresql.org (postgresql.org [64.49.215.8]) - by postgresql.org (Postfix) with SMTP - id 47B4847609E; Tue, 25 Jun 2002 10:34:29 -0400 (EDT) -Received: from localhost.localdomain (postgresql.org [64.49.215.8]) - by localhost (Postfix) with ESMTP id 52A5F475E5F - for ; Tue, 25 Jun 2002 10:34:25 -0400 (EDT) -Mailbox-Line: From tgl@sss.pgh.pa.us Tue Jun 25 10:34:25 2002 -Received: from sss.pgh.pa.us (unknown [192.204.191.242]) - by postgresql.org (Postfix) with ESMTP id 458BB476239 - for ; Tue, 25 Jun 2002 10:32:12 -0400 (EDT) -Received: from sss2.sss.pgh.pa.us (tgl@localhost [127.0.0.1]) - by sss.pgh.pa.us (8.11.4/8.11.4) with ESMTP id g5PEWA107527; - Tue, 25 Jun 2002 10:32:10 -0400 (EDT) -To: Bruce Momjian -cc: Curt Sampson , "J. R. Nield" , - PostgreSQL Hacker -Subject: Re: [HACKERS] Buffer Management -In-Reply-To: <200206251420.g5PEKT310222@candle.pha.pa.us> -References: <200206251420.g5PEKT310222@candle.pha.pa.us> -Comments: In-reply-to Bruce Momjian - message dated "Tue, 25 Jun 2002 10:20:29 -0400" -Date: Tue, 25 Jun 2002 10:32:10 -0400 -Message-ID: <7524.1025015530@sss.pgh.pa.us> -From: Tom Lane -Precedence: bulk -Sender: pgsql-hackers-owner@postgresql.org -X-Spam-Status: No, hits=-5.3 required=5.0 - tests=IN_REP_TO,X_NOT_PRESENT - version=2.30 -Status: ORr - -Bruce Momjian writes: -> This will also work well when we have non-SysV semaphore support, like -> Posix semaphores, so we would be able to run with no SysV stuff. - -You do realize that we can use Posix semaphores today? The Darwin (OS X) -port uses 'em now. That's one reason I am more interested in mmap as -a shmget substitute than I used to be. - - regards, tom lane - - - ----------------------------(end of broadcast)--------------------------- -TIP 5: Have you checked our extensive FAQ? - -http://www.postgresql.org/users-lounge/docs/faq.html - - - -From pgsql-hackers-owner+M24167@postgresql.org Tue Jun 25 11:02:20 2002 -Return-path: -Received: from postgresql.org (postgresql.org [64.49.215.8]) - by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5PF2JF16153 - for ; Tue, 25 Jun 2002 11:02:20 -0400 (EDT) -Received: from localhost.localdomain (postgresql.org [64.49.215.8]) - by localhost (Postfix) with ESMTP - id 7FB0F47630C; Tue, 25 Jun 2002 11:02:11 -0400 (EDT) -Mailbox-Line: From pgman@candle.pha.pa.us Tue Jun 25 11:02:11 2002 -Received: from postgresql.org (postgresql.org [64.49.215.8]) - by postgresql.org (Postfix) with SMTP - id B755E475C22; Tue, 25 Jun 2002 10:59:45 -0400 (EDT) -Received: from localhost.localdomain (postgresql.org [64.49.215.8]) - by localhost (Postfix) with ESMTP id 7D058476387 - for ; Tue, 25 Jun 2002 10:59:38 -0400 (EDT) -Mailbox-Line: From pgman@candle.pha.pa.us Tue Jun 25 10:59:38 2002 -Received: from candle.pha.pa.us (216-55-132-35.dsl.san-diego.abac.net [216.55.132.35]) - by postgresql.org (Postfix) with ESMTP id 49F8C475DC6 - for ; Tue, 25 Jun 2002 10:56:00 -0400 (EDT) -Received: (from pgman@localhost) - by candle.pha.pa.us (8.11.6/8.10.1) id g5PEtst15464; - Tue, 25 Jun 2002 10:55:54 -0400 (EDT) -From: Bruce Momjian -Message-ID: <200206251455.g5PEtst15464@candle.pha.pa.us> -Subject: Re: [HACKERS] Buffer Management -In-Reply-To: <7524.1025015530@sss.pgh.pa.us> -To: Tom Lane -Date: Tue, 25 Jun 2002 10:55:54 -0400 (EDT) -cc: Curt Sampson , "J. R. Nield" , - PostgreSQL Hacker -X-Mailer: ELM [version 2.4ME+ PL97 (25)] -MIME-Version: 1.0 -Content-Transfer-Encoding: 7bit -Content-Type: text/plain; charset=US-ASCII -Precedence: bulk -Sender: pgsql-hackers-owner@postgresql.org -X-Spam-Status: No, hits=-3.4 required=5.0 - tests=IN_REP_TO - version=2.30 -Status: OR - -Tom Lane wrote: -> Bruce Momjian writes: -> > This will also work well when we have non-SysV semaphore support, like -> > Posix semaphores, so we would be able to run with no SysV stuff. -> -> You do realize that we can use Posix semaphores today? The Darwin (OS X) -> port uses 'em now. That's one reason I am more interested in mmap as - -No, I didn't realize we had gotten that far. - --- - Bruce Momjian | http://candle.pha.pa.us - pgman@candle.pha.pa.us | (610) 853-3000 - + If your life is a hard drive, | 830 Blythe Avenue - + Christ can be your backup. | Drexel Hill, Pennsylvania 19026 - - - ----------------------------(end of broadcast)--------------------------- -TIP 2: you can get off all lists at once with the unregister command - (send "unregister YourEmailAddressHere" to majordomo@postgresql.org) - - - -From pgsql-hackers-owner+M24168@postgresql.org Tue Jun 25 11:05:13 2002 -Return-path: -Received: from postgresql.org (postgresql.org [64.49.215.8]) - by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5PF5CF16398 - for ; Tue, 25 Jun 2002 11:05:13 -0400 (EDT) -Received: from localhost.localdomain (postgresql.org [64.49.215.8]) - by localhost (Postfix) with ESMTP - id 30D2847634D; Tue, 25 Jun 2002 11:05:04 -0400 (EDT) -Mailbox-Line: From pgman@candle.pha.pa.us Tue Jun 25 11:05:04 2002 -Received: from postgresql.org (postgresql.org [64.49.215.8]) - by postgresql.org (Postfix) with SMTP - id B49B5475EFA; Tue, 25 Jun 2002 10:59:47 -0400 (EDT) -Received: from localhost.localdomain (postgresql.org [64.49.215.8]) - by localhost (Postfix) with ESMTP id A0F20475978 - for ; Tue, 25 Jun 2002 10:59:43 -0400 (EDT) -Mailbox-Line: From pgman@candle.pha.pa.us Tue Jun 25 10:59:43 2002 -Received: from candle.pha.pa.us (216-55-132-35.dsl.san-diego.abac.net [216.55.132.35]) - by postgresql.org (Postfix) with ESMTP id 8160E4762F0 - for ; Tue, 25 Jun 2002 10:57:03 -0400 (EDT) -Received: (from pgman@localhost) - by candle.pha.pa.us (8.11.6/8.10.1) id g5PEuwO15564; - Tue, 25 Jun 2002 10:56:58 -0400 (EDT) -From: Bruce Momjian -Message-ID: <200206251456.g5PEuwO15564@candle.pha.pa.us> -Subject: Re: [HACKERS] Buffer Management -In-Reply-To: <7498.1025015389@sss.pgh.pa.us> -To: Tom Lane -Date: Tue, 25 Jun 2002 10:56:58 -0400 (EDT) -cc: Curt Sampson , "J. R. Nield" , - PostgreSQL Hacker -X-Mailer: ELM [version 2.4ME+ PL97 (25)] -MIME-Version: 1.0 -Content-Transfer-Encoding: 7bit -Content-Type: text/plain; charset=US-ASCII -Precedence: bulk -Sender: pgsql-hackers-owner@postgresql.org -X-Spam-Status: No, hits=-2.3 required=5.0 - tests=IN_REP_TO,DOUBLE_CAPSWORD - version=2.30 -Status: OR - -Tom Lane wrote: -> Curt Sampson writes: -> > On Tue, 25 Jun 2002, Tom Lane wrote: -> >> The other discussion seemed to be considering how to mmap individual -> >> data files right into backends' address space. I do not believe this -> >> can possibly work, because of loss of control over visibility of data -> >> changes to other backends, timing of write-backs, etc. -> -> > I don't understand why there would be any loss of visibility of changes. -> > If two backends mmap the same block of a file, and it's shared, that's -> > the same block of physical memory that they're accessing. -> -> Is it? You have a mighty narrow conception of the range of -> implementations that's possible for mmap. -> -> But the main problem is that mmap doesn't let us control when changes to -> the memory buffer will get reflected back to disk --- AFAICT, the OS is -> free to do the write-back at any instant after you dirty the page, and -> that completely breaks the WAL algorithm. (WAL = write AHEAD log; -> the log entry describing a change must hit disk before the data page -> change itself does.) - -Can we mmap WAL without problems? Not sure if there is any gain to it -because we just write it and rarely read from it. - --- - Bruce Momjian | http://candle.pha.pa.us - pgman@candle.pha.pa.us | (610) 853-3000 - + If your life is a hard drive, | 830 Blythe Avenue - + Christ can be your backup. | Drexel Hill, Pennsylvania 19026 - - - ----------------------------(end of broadcast)--------------------------- -TIP 2: you can get off all lists at once with the unregister command - (send "unregister YourEmailAddressHere" to majordomo@postgresql.org) - - - -From tgl@sss.pgh.pa.us Tue Jun 25 11:00:20 2002 -Return-path: -Received: from sss.pgh.pa.us (root@[192.204.191.242]) - by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5PF0JF15955 - for ; Tue, 25 Jun 2002 11:00:19 -0400 (EDT) -Received: from sss2.sss.pgh.pa.us (tgl@localhost [127.0.0.1]) - by sss.pgh.pa.us (8.11.4/8.11.4) with ESMTP id g5PF0J107808; - Tue, 25 Jun 2002 11:00:19 -0400 (EDT) -To: Bruce Momjian -cc: Curt Sampson , "J. R. Nield" , - PostgreSQL Hacker -Subject: Re: [HACKERS] Buffer Management -In-Reply-To: <200206251456.g5PEuwO15564@candle.pha.pa.us> -References: <200206251456.g5PEuwO15564@candle.pha.pa.us> -Comments: In-reply-to Bruce Momjian - message dated "Tue, 25 Jun 2002 10:56:58 -0400" -Date: Tue, 25 Jun 2002 11:00:19 -0400 -Message-ID: <7805.1025017219@sss.pgh.pa.us> -From: Tom Lane -Status: ORr - -Bruce Momjian writes: -> Can we mmap WAL without problems? Not sure if there is any gain to it -> because we just write it and rarely read from it. - -Perhaps, but I don't see any point to it. - - regards, tom lane - -From pgsql-hackers-owner+M24171@postgresql.org Tue Jun 25 11:14:23 2002 -Return-path: -Received: from postgresql.org (postgresql.org [64.49.215.8]) - by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5PFENF17356 - for ; Tue, 25 Jun 2002 11:14:23 -0400 (EDT) -Received: from localhost.localdomain (postgresql.org [64.49.215.8]) - by localhost (Postfix) with ESMTP - id 8EAA3476244; Tue, 25 Jun 2002 11:14:09 -0400 (EDT) -Mailbox-Line: From pgman@candle.pha.pa.us Tue Jun 25 11:14:09 2002 -Received: from postgresql.org (postgresql.org [64.49.215.8]) - by postgresql.org (Postfix) with SMTP - id C32024762B0; Tue, 25 Jun 2002 11:10:33 -0400 (EDT) -Received: from localhost.localdomain (postgresql.org [64.49.215.8]) - by localhost (Postfix) with ESMTP id 1F81C4762A2 - for ; Tue, 25 Jun 2002 11:10:31 -0400 (EDT) -Mailbox-Line: From pgman@candle.pha.pa.us Tue Jun 25 11:10:31 2002 -Received: from candle.pha.pa.us (216-55-132-35.dsl.san-diego.abac.net [216.55.132.35]) - by postgresql.org (Postfix) with ESMTP id CE09D475B33 - for ; Tue, 25 Jun 2002 11:02:10 -0400 (EDT) -Received: (from pgman@localhost) - by candle.pha.pa.us (8.11.6/8.10.1) id g5PF25r16113; - Tue, 25 Jun 2002 11:02:05 -0400 (EDT) -From: Bruce Momjian -Message-ID: <200206251502.g5PF25r16113@candle.pha.pa.us> -Subject: Re: [HACKERS] Buffer Management -In-Reply-To: <7805.1025017219@sss.pgh.pa.us> -To: Tom Lane -Date: Tue, 25 Jun 2002 11:02:05 -0400 (EDT) -cc: Curt Sampson , "J. R. Nield" , - PostgreSQL Hacker -X-Mailer: ELM [version 2.4ME+ PL97 (25)] -MIME-Version: 1.0 -Content-Transfer-Encoding: 7bit -Content-Type: text/plain; charset=US-ASCII -Precedence: bulk -Sender: pgsql-hackers-owner@postgresql.org -X-Spam-Status: No, hits=-3.4 required=5.0 - tests=IN_REP_TO - version=2.30 -Status: OR - -Tom Lane wrote: -> Bruce Momjian writes: -> > Can we mmap WAL without problems? Not sure if there is any gain to it -> > because we just write it and rarely read from it. -> -> Perhaps, but I don't see any point to it. - -Agreed. I have been poking around google looking for an article I read -months ago saying that mmap of files is slighly faster in low memory -usage situations, but much slower in high memory usage situations -because the kernel doesn't know as much about the file access in mmap as -it does with stdio. I will find it. :-) - --- - Bruce Momjian | http://candle.pha.pa.us - pgman@candle.pha.pa.us | (610) 853-3000 - + If your life is a hard drive, | 830 Blythe Avenue - + Christ can be your backup. | Drexel Hill, Pennsylvania 19026 - - - ----------------------------(end of broadcast)--------------------------- -TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org - - - -From pgsql-hackers-owner+M24179@postgresql.org Tue Jun 25 12:13:40 2002 -Return-path: -Received: from postgresql.org (postgresql.org [64.49.215.8]) - by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5PGDdF22106 - for ; Tue, 25 Jun 2002 12:13:39 -0400 (EDT) -Received: from localhost.localdomain (postgresql.org [64.49.215.8]) - by localhost (Postfix) with ESMTP - id 962BD4762AF; Tue, 25 Jun 2002 12:13:32 -0400 (EDT) -Mailbox-Line: From brad@bradm.net Tue Jun 25 12:13:32 2002 -Received: from postgresql.org (postgresql.org [64.49.215.8]) - by postgresql.org (Postfix) with SMTP - id 06727476181; Tue, 25 Jun 2002 12:13:31 -0400 (EDT) -Received: from localhost.localdomain (postgresql.org [64.49.215.8]) - by localhost (Postfix) with ESMTP id AB1CB4760F7 - for ; Tue, 25 Jun 2002 12:13:28 -0400 (EDT) -Mailbox-Line: From brad@bradm.net Tue Jun 25 12:13:28 2002 -Received: from bradm.net (208-59-250-198.c3-0.smr-ubr1.sbo-smr.ma.cable.rcn.com [208.59.250.198]) - by postgresql.org (Postfix) with ESMTP id 594BD476083 - for ; Tue, 25 Jun 2002 12:13:27 -0400 (EDT) -Received: (from brad@localhost) - by bradm.net (8.11.6/8.11.6) id g5PGCjA14829; - Tue, 25 Jun 2002 12:12:45 -0400 -Date: Tue, 25 Jun 2002 12:12:45 -0400 -From: Bradley McLean -To: Tom Lane -cc: Mario Weilguni , - Curt Sampson , "J. R. Nield" , - Bruce Momjian , - PostgreSQL Hacker -Subject: Re: [HACKERS] Buffer Management -Message-ID: <20020625121245.A14762@nia.bradm.net> -References: <4D618F6493CE064A844A5D496733D667038E68@freedom.icomedias.com> <7703.1025016772@sss.pgh.pa.us> -MIME-Version: 1.0 -Content-Type: text/plain; charset=us-ascii -Content-Disposition: inline -User-Agent: Mutt/1.2.5.1i -In-Reply-To: <7703.1025016772@sss.pgh.pa.us>; from tgl@sss.pgh.pa.us on Tue, Jun 25, 2002 at 10:52:52AM -0400 -Precedence: bulk -Sender: pgsql-hackers-owner@postgresql.org -X-Spam-Status: No, hits=-4.2 required=5.0 - tests=IN_REP_TO,X_NOT_PRESENT,DOUBLE_CAPSWORD - version=2.30 -Status: OR - -* Tom Lane (tgl@sss.pgh.pa.us) [020625 11:00]: -> -> msync can force not-yet-written changes down to disk. It does not -> prevent the OS from choosing to write changes *before* you invoke msync. -> -> Our problem is that we want to enforce the write ordering "WAL before -> data file". To do that, we write and fsync (or DSYNC, or something) -> a WAL entry before we issue the write() against the data file. We -> don't really care if the kernel delays the data file write beyond that -> point, but we can be certain that the data file write did not occur -> too early. -> -> msync is designed to ensure exactly the opposite constraint: it can -> guarantee that no changes remain unwritten after time T, but it can't -> guarantee that changes aren't written before time T. - -Okay, so instead of looking for constraints from the OS on the data file, -use the constraints on the WAL file. It would work at the cost of a buffer -copy? Er, maybe two: - -mmap the data file and WAL separately. -Copy the data file page to the WAL mmap area. -Modify the page. -msync() the WAL. -Copy the page to the data file mmap area. -msync() or not the data file. - -(This is half baked, just thought I'd see if it stirred further thought). - -As another approach, how expensive is re-MMAPing portions of the files -compared to the copies. - --Brad - -> -> regards, tom lane -> -> -> -> ---------------------------(end of broadcast)--------------------------- -> TIP 3: if posting/reading through Usenet, please send an appropriate -> subscribe-nomail command to majordomo@postgresql.org so that your -> message can get through to the mailing list cleanly -> - - - ----------------------------(end of broadcast)--------------------------- -TIP 4: Don't 'kill -9' the postmaster - - - -From cjs@cynic.net Wed Jun 26 00:13:45 2002 -Return-path: -Received: from academic.cynic.net (academic.cynic.net [63.144.177.3]) - by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5Q4Dig27201 - for ; Wed, 26 Jun 2002 00:13:45 -0400 (EDT) -Received: from angelic-academic.cvpn.cynic.net (angelic-academic.cvpn.cynic.net [198.73.220.224]) - by academic.cynic.net (Postfix) with ESMTP - id B95E5F820; Wed, 26 Jun 2002 04:13:45 +0000 (UTC) -Date: Wed, 26 Jun 2002 13:13:42 +0900 (JST) -From: Curt Sampson -To: Tom Lane -cc: "J. R. Nield" , Bruce Momjian , - PostgreSQL Hacker -Subject: Re: [HACKERS] Buffer Management -In-Reply-To: <7498.1025015389@sss.pgh.pa.us> -Message-ID: -MIME-Version: 1.0 -Content-Type: TEXT/PLAIN; charset=US-ASCII -Status: OR - -On Tue, 25 Jun 2002, Tom Lane wrote: - -> Curt Sampson writes: -> -> > I don't understand why there would be any loss of visibility of changes. -> > If two backends mmap the same block of a file, and it's shared, that's -> > the same block of physical memory that they're accessing. -> -> Is it? You have a mighty narrow conception of the range of -> implementations that's possible for mmap. - -It's certainly possible to implement something that you call mmap -that is not. But if you are using the posix-defined MAP_SHARED flag, -the behaviour above is what you see. It might be implemented slightly -differently internally, but that's no concern of postgres. And I find -it pretty unlikely that it would be implemented otherwise without good -reason. - -Note that your proposal of using mmap to replace sysv shared memory -relies on the behaviour I've described too. As well, if you're replacing -sysv shared memory with an mmap'd file, you may end up doing excessive -disk I/O on systems without the MAP_NOSYNC option. (Without this option, -the update thread/daemon may ensure that every buffer is flushed to the -backing store on disk every 30 seconds or so. You might be able to get -around this by using a small file-backed area for things that need to -persist after a crash, and a larger anonymous area for things that don't -need to persist after a crash.) - -> But the main problem is that mmap doesn't let us control when changes to -> the memory buffer will get reflected back to disk --- AFAICT, the OS is -> free to do the write-back at any instant after you dirty the page, and -> that completely breaks the WAL algorithm. (WAL = write AHEAD log; -> the log entry describing a change must hit disk before the data page -> change itself does.) - -Hm. Well ,we could try not to write the data to the page until -after we receive notification that our WAL data is committed to -stable storage. However, new the data has to be availble to all of -the backends at the exact time that the commit happens. Perhaps a -shared list of pending writes? - -Another option would be to just let it write, but on startup, scan -all of the data blocks in the database for tuples that have a -transaction ID later than the last one we updated to, and remove -them. That could pretty darn expensive on a large database, though. - -cjs --- -Curt Sampson +81 90 7737 2974 http://www.netbsd.org - Don't you know, in this new Dark Age, we're all light. --XTC - - -From tgl@sss.pgh.pa.us Wed Jun 26 09:22:05 2002 -Return-path: -Received: from sss.pgh.pa.us (root@[192.204.191.242]) - by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5QDM3g26028 - for ; Wed, 26 Jun 2002 09:22:04 -0400 (EDT) -Received: from sss2.sss.pgh.pa.us (tgl@localhost [127.0.0.1]) - by sss.pgh.pa.us (8.11.4/8.11.4) with ESMTP id g5QDLxv01699; - Wed, 26 Jun 2002 09:21:59 -0400 (EDT) -To: Curt Sampson -cc: "J. R. Nield" , Bruce Momjian , - PostgreSQL Hacker -Subject: Re: [HACKERS] Buffer Management -In-Reply-To: -References: -Comments: In-reply-to Curt Sampson - message dated "Wed, 26 Jun 2002 13:13:42 +0900" -Date: Wed, 26 Jun 2002 09:21:59 -0400 -Message-ID: <1696.1025097719@sss.pgh.pa.us> -From: Tom Lane -Status: ORr - -Curt Sampson writes: -> Note that your proposal of using mmap to replace sysv shared memory -> relies on the behaviour I've described too. - -True, but I was not envisioning mapping an actual file --- at least -on HPUX, the only way to generate an arbitrary-sized shared memory -region is to use MAP_ANONYMOUS and not have the mmap'd area connected -to any file at all. It's not farfetched to think that this aspect -of mmap might work differently from mapping pieces of actual files. - -In practice of course we'd have to restrict use of any such -implementation to platforms where mmap behaves reasonably ... according -to our definition of "reasonably". - - regards, tom lane - -From pgsql-hackers-owner+M24252@postgresql.org Wed Jun 26 16:14:36 2002 -Return-path: -Received: from postgresql.org (postgresql.org [64.49.215.8]) - by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5QKEag03467 - for ; Wed, 26 Jun 2002 16:14:36 -0400 (EDT) -Received: from localhost.localdomain (postgresql.org [64.49.215.8]) - by localhost (Postfix) with ESMTP - id B10E9476B4D; Wed, 26 Jun 2002 15:16:32 -0400 (EDT) -Mailbox-Line: From pgman@candle.pha.pa.us Wed Jun 26 15:16:32 2002 -Received: from postgresql.org (postgresql.org [64.49.215.8]) - by postgresql.org (Postfix) with SMTP - id 6635E476DC0; Wed, 26 Jun 2002 14:31:10 -0400 (EDT) -Received: from localhost.localdomain (postgresql.org [64.49.215.8]) - by localhost (Postfix) with ESMTP id 13F884765BD - for ; Wed, 26 Jun 2002 14:22:36 -0400 (EDT) -Mailbox-Line: From pgman@candle.pha.pa.us Wed Jun 26 14:22:36 2002 -Received: from candle.pha.pa.us (216-55-132-35.dsl.san-diego.abac.net [216.55.132.35]) - by postgresql.org (Postfix) with ESMTP id 3F02D476EB3 - for ; Wed, 26 Jun 2002 13:11:37 -0400 (EDT) -Received: (from pgman@localhost) - by candle.pha.pa.us (8.11.6/8.10.1) id g5QHBJM15565; - Wed, 26 Jun 2002 13:11:19 -0400 (EDT) -From: Bruce Momjian -Message-ID: <200206261711.g5QHBJM15565@candle.pha.pa.us> -Subject: Re: [HACKERS] Buffer Management -In-Reply-To: <1696.1025097719@sss.pgh.pa.us> -To: Tom Lane -Date: Wed, 26 Jun 2002 13:11:19 -0400 (EDT) -cc: Curt Sampson , "J. R. Nield" , - PostgreSQL Hacker -X-Mailer: ELM [version 2.4ME+ PL97 (25)] -MIME-Version: 1.0 -Content-Transfer-Encoding: 7bit -Content-Type: text/plain; charset=US-ASCII -Precedence: bulk -Sender: pgsql-hackers-owner@postgresql.org -X-Spam-Status: No, hits=-3.4 required=5.0 - tests=IN_REP_TO - version=2.30 -Status: OR - -Tom Lane wrote: -> Curt Sampson writes: -> > Note that your proposal of using mmap to replace sysv shared memory -> > relies on the behaviour I've described too. -> -> True, but I was not envisioning mapping an actual file --- at least -> on HPUX, the only way to generate an arbitrary-sized shared memory -> region is to use MAP_ANONYMOUS and not have the mmap'd area connected -> to any file at all. It's not farfetched to think that this aspect -> of mmap might work differently from mapping pieces of actual files. -> -> In practice of course we'd have to restrict use of any such -> implementation to platforms where mmap behaves reasonably ... according -> to our definition of "reasonably". - -Yes, I am told mapping /dev/zero is the same as the anon map. - --- - Bruce Momjian | http://candle.pha.pa.us - pgman@candle.pha.pa.us | (610) 853-3000 - + If your life is a hard drive, | 830 Blythe Avenue - + Christ can be your backup. | Drexel Hill, Pennsylvania 19026 - - - ----------------------------(end of broadcast)--------------------------- -TIP 6: Have you searched our list archives? - -http://archives.postgresql.org - - - -From pgsql-hackers-owner+M24292@postgresql.org Wed Jun 26 23:39:10 2002 -Return-path: -Received: from postgresql.org (postgresql.org [64.49.215.8]) - by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id g5R3d9g02161 - for ; Wed, 26 Jun 2002 23:39:09 -0400 (EDT) -Received: from localhost.localdomain (postgresql.org [64.49.215.8]) - by localhost (Postfix) with ESMTP - id 88BF4476287; Wed, 26 Jun 2002 23:38:56 -0400 (EDT) -Mailbox-Line: From cjs@cynic.net Wed Jun 26 23:38:56 2002 -Received: from postgresql.org (postgresql.org [64.49.215.8]) - by postgresql.org (Postfix) with SMTP - id 3C069476954; Wed, 26 Jun 2002 23:38:17 -0400 (EDT) -Received: from localhost.localdomain (postgresql.org [64.49.215.8]) - by localhost (Postfix) with ESMTP id A0397476941 - for ; Wed, 26 Jun 2002 23:38:12 -0400 (EDT) -Mailbox-Line: From cjs@cynic.net Wed Jun 26 23:38:12 2002 -Received: from academic.cynic.net (academic.cynic.net [63.144.177.3]) - by postgresql.org (Postfix) with ESMTP id 2AA24475C40 - for ; Wed, 26 Jun 2002 23:37:18 -0400 (EDT) -Received: from angelic-academic.cvpn.cynic.net (angelic-academic.cvpn.cynic.net [198.73.220.224]) - by academic.cynic.net (Postfix) with ESMTP - id 179D5F822; Thu, 27 Jun 2002 03:37:20 +0000 (UTC) -Date: Thu, 27 Jun 2002 12:37:18 +0900 (JST) -From: Curt Sampson -To: Tom Lane -cc: "J. R. Nield" , Bruce Momjian , - PostgreSQL Hacker -Subject: Re: [HACKERS] Buffer Management -In-Reply-To: <1696.1025097719@sss.pgh.pa.us> -Message-ID: -MIME-Version: 1.0 -Content-Type: TEXT/PLAIN; charset=US-ASCII -Precedence: bulk -Sender: pgsql-hackers-owner@postgresql.org -X-Spam-Status: No, hits=-5.3 required=5.0 - tests=IN_REP_TO,X_NOT_PRESENT - version=2.30 -Status: OR - -On Wed, 26 Jun 2002, Tom Lane wrote: - -> Curt Sampson writes: -> > Note that your proposal of using mmap to replace sysv shared memory -> > relies on the behaviour I've described too. -> -> True, but I was not envisioning mapping an actual file --- at least -> on HPUX, the only way to generate an arbitrary-sized shared memory -> region is to use MAP_ANONYMOUS and not have the mmap'd area connected -> to any file at all. It's not farfetched to think that this aspect -> of mmap might work differently from mapping pieces of actual files. - -I find it somewhat farfetched, for a couple of reasons: - - 1. Memory mapped with the MAP_SHARED flag is shared memory, - anonymous or not. POSIX is pretty explicit about how this works, - and the "standard" for mmap that predates POSIX is the same. - Anonymous memory does not behave differently. - - You could just as well say that some systems might exist such - that one process can write() a block to a file, and then another - might read() it afterwards but not see the changes. Postgres - should not try to deal with hypothetical systems that are so - completely broken. - - 2. Mmap is implemented as part of a unified buffer cache system - on all of today's operating systems that I know of. The memory - is backed by swap space when anonymous, and by a specified file - when not anonymous; but the way these two are handled is - *exactly* the same internally. - - Even on older systems without unified buffer cache, the behaviour - is the same between anonymous and file-backed mmap'd memory. - And there would be no point in making it otherwise. Mmap is - designed to let you share memory; why make a broken implementation - under certain circumstances? - -> In practice of course we'd have to restrict use of any such -> implementation to platforms where mmap behaves reasonably ... according -> to our definition of "reasonably". - -Of course. As we do already with regular I/O. - -cjs --- -Curt Sampson +81 90 7737 2974 http://www.netbsd.org - Don't you know, in this new Dark Age, we're all light. --XTC - - - - ----------------------------(end of broadcast)--------------------------- -TIP 3: if posting/reading through Usenet, please send an appropriate -subscribe-nomail command to majordomo@postgresql.org so that your -message can get through to the mailing list cleanly - - - -From pgsql-committers-owner+M9273=maillist=candle.pha.pa.us@postgresql.org Thu Mar 6 19:37:25 2003 -Return-path: -Received: from relay2.pgsql.com (relay2.pgsql.com [64.49.215.143]) - by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id h270bM624923 - for ; Thu, 6 Mar 2003 19:37:24 -0500 (EST) -Received: from postgresql.org (postgresql.org [64.49.215.8]) - by relay2.pgsql.com (Postfix) with ESMTP id 4D5CDEE0411 - for ; Thu, 6 Mar 2003 19:37:23 -0500 (EST) -X-Original-To: pgsql-committers@postgresql.org -Received: from perrin.int.nxad.com (internal.ext.nxad.com [69.1.70.251]) - by postgresql.org (Postfix) with ESMTP - id 3120E47646F; Thu, 6 Mar 2003 19:36:58 -0500 (EST) -Received: by perrin.int.nxad.com (Postfix, from userid 1001) - id 9CBE42105B; Thu, 6 Mar 2003 16:36:40 -0800 (PST) -Date: Thu, 6 Mar 2003 16:36:40 -0800 -From: Sean Chittenden -To: Tom Lane -cc: Christopher Kings-Lynne , - pgsql-committers@postgresql.org, pgsql-performance@postgresql.org -Subject: Re: [COMMITTERS] pgsql-server/ /configure /configure.in rc/incl ... -Message-ID: <20030307003640.GF79234@perrin.int.nxad.com> -References: <20030306031656.1876F4762E0@postgresql.org> <032f01c2e390$b1842b20$6500a8c0@fhp.internal> <11077.1046921667@sss.pgh.pa.us> <033f01c2e392$71476570$6500a8c0@fhp.internal> <12228.1046922471@sss.pgh.pa.us> <20030306094117.GA79234@perrin.int.nxad.com> <15071.1046964336@sss.pgh.pa.us> -MIME-Version: 1.0 -Content-Type: multipart/signed; micalg=pgp-sha1; - protocol="application/pgp-signature"; boundary="HjNkcEWJ4DMx36DP" -Content-Disposition: inline -In-Reply-To: <15071.1046964336@sss.pgh.pa.us> -User-Agent: Mutt/1.4i -X-PGP-Key: finger seanc@FreeBSD.org -X-PGP-Fingerprint: 3849 3760 1AFE 7B17 11A0 83A6 DD99 E31F BC84 B341 -X-Web-Homepage: http://sean.chittenden.org/ -Precedence: bulk -Sender: pgsql-committers-owner@postgresql.org -Status: OR - ---HjNkcEWJ4DMx36DP -Content-Type: text/plain; charset=us-ascii -Content-Disposition: inline -Content-Transfer-Encoding: quoted-printable - -[moving to -performance, please drop -committers from replies] - -> > I've toyed with the idea of adding this because it is monstrously more -> > efficient than select()/poll() in basically every way, shape, and -> > form. ->=20 -> From what I've looked at, kqueue only wins when you are watching a -> large number of file descriptors at the same time; which is an -> operation done nowhere in Postgres. I think the above would be a -> complete waste of effort. - -It scales very well to many thousands of descriptors, but it also -works well on small numbers as well. kqueue is about 5x faster than -select() or poll() on the low end of number of fd's. As I said -earlier, I don't think there is _much_ to gain in this regard, but I -do think that it would be a speed improvement but only to one OS -supported by PostgreSQL. I think that there are bigger speed -improvements to be had elsewhere in the code. - -> > Is this one of the areas of PostgreSQL that just needs to get -> > slowly migrated to use mmap() or are there any gaping reasons why -> > to not use the family of system calls? ->=20 -> There has been much speculation on this, and no proof that it -> actually buys us anything to justify the portability hit. - -Actually, I think that it wouldn't be that big of a portability hit -because you still would read() and write() as always, but in -performance sensitive areas, an #ifdef HAVE_MMAP section would have -the appropriate mmap() calls. If the system doesn't have mmap(), -there isn't much to loose and we're in the same position we're in now. - -> There would be some nontrivial problems to solve, such as the -> mechanics of accessing a large number of files from a large number -> of backends without running out of virtual memory. Also, is it -> guaranteed that multiple backends mmap'ing the same block will -> access the very same physical buffer, and not multiple copies? -> Multiple copies would be fatal. See the acrhives for more -> discussion. - -Have read through the archives. Making a call to madvise() will speed -up access to the pages as it gives hints to the VM about what order -the pages are accessed/used. Here are a few bits from the BSD mmap() -and madvise() man pages: - -mmap(2): - MAP_NOSYNC Causes data dirtied via this VM map to be flushed to - physical media only when necessary (usually by the - pager) rather then gratuitously. Typically this pre- - vents the update daemons from flushing pages dirtied - through such maps and thus allows efficient sharing = -of - memory across unassociated processes using a file- - backed shared memory map. Without this option any VM - pages you dirty may be flushed to disk every so often - (every 30-60 seconds usually) which can create perfo= -r- - mance problems if you do not need that to occur (such - as when you are using shared file-backed mmap regions - for IPC purposes). Note that VM/filesystem coherency - is maintained whether you use MAP_NOSYNC or not. Th= -is - option is not portable across UNIX platforms (yet), - though some may implement the same behavior by defau= -lt. - - WARNING! Extending a file with ftruncate(2), thus c= -re- - ating a big hole, and then filling the hole by modif= -y- - ing a shared mmap() can lead to severe file fragment= -a- - tion. In order to avoid such fragmentation you shou= -ld - always pre-allocate the file's backing store by - write()ing zero's into the newly extended area prior= - to - modifying the area via your mmap(). The fragmentati= -on - problem is especially sensitive to MAP_NOSYNC pages, - because pages may be flushed to disk in a totally ra= -n- - dom order. - - The same applies when using MAP_NOSYNC to implement a - file-based shared memory store. It is recommended t= -hat - you create the backing store by write()ing zero's to - the backing file rather then ftruncate()ing it. You - can test file fragmentation by observing the KB/t - (kilobytes per transfer) results from an ``iostat 1'' - while reading a large file sequentially, e.g. using - ``dd if=3Dfilename of=3D/dev/null bs=3D32k''. - - The fsync(2) function will flush all dirty data and - metadata associated with a file, including dirty NOS= -YNC - VM data, to physical media. The sync(8) command and - sync(2) system call generally do not flush dirty NOS= -YNC - VM data. The msync(2) system call is obsolete since - BSD implements a coherent filesystem buffer cache. - However, it may be used to associate dirty VM pages - with filesystem buffers and thus cause them to be - flushed to physical media sooner rather then later. - -madvise(2): - MADV_NORMAL Tells the system to revert to the default paging beha= -v- - ior. - - MADV_RANDOM Is a hint that pages will be accessed randomly, and - prefetching is likely not advantageous. - - MADV_SEQUENTIAL Causes the VM system to depress the priority of pages - immediately preceding a given page when it is faulted - in. - -mprotect(2): - The mprotect() system call changes the specified pages to have protect= -ion - prot. Not all implementations will guarantee protection on a page bas= -is; - the granularity of protection changes may be as large as an entire - region. A region is the virtual address space defined by the start and - end addresses of a struct vm_map_entry. - - Currently these protection bits are known, which can be combined, OR'd - together: - - PROT_NONE No permissions at all. - - PROT_READ The pages can be read. - - PROT_WRITE The pages can be written. - - PROT_EXEC The pages can be executed. - -msync(2): - The msync() system call writes any modified pages back to the filesyst= -em - and updates the file modification time. If len is 0, all modified pag= -es - within the region containing addr will be flushed; if len is non-zero, - only those pages containing addr and len-1 succeeding locations will be - examined. The flags argument may be specified as follows: - - MS_ASYNC Return immediately - MS_SYNC Perform synchronous writes - MS_INVALIDATE Invalidate all cached data - - -A few thoughts come to mind: - -1) backends could share buffers by mmap()'ing shared regions of data. - While I haven't seen any numbers to reflect this, I'd wager that - mmap() is a faster interface than ipc. - -2) It looks like while there are various file IO schemes scattered all - over the place, the bulk of the critical routines that would need - to be updated are in backend/storage/file/fd.c, more specifically: - - *) fileNameOpenFile() would need the appropriate mmap() call made - to it. - - *) FileTruncate() would need some attention to avoid fragmentation. - - *) a new "sync" GUC would have to be introduced to handle msync - (affects only pg_fsync() and pg_fdatasync()). - -3) There's a bit of code in pgsql/src/backend/storage/smgr that could - be gutted/removed. Which of those storage types are even used any - more? There's a reference in the code to PostgreSQL 3.0. :) - -And I think that'd be it. The LRU code could be used if necessary to -help manage the amount of mmap()'ed in the VM at any one time, at the -very least that could be a handled by a shm var that various backends -would increment/decrement as files are open()'ed/close()'ed. - -I didn't spend too long looking at this, but I _think_ that'd cover -80% of PostgreSQL's disk access needs. The next bit to possibly add -would be passing a flag on FileOpen operations that'd act as a hint to -madvise() that way the VM could proactively react to PostgreSQL's -needs. - -I don't have my copy of Steven's handy (it's some 700mi away atm -otherwise I'd cite it), but if Tom or someone else has it handy, look -up the example re: the performance gain from read()'ing an mmap()'ed -file versus a non-mmap()'ed file. The difference is non-trivial and -_WELL_ worth the time given the speed increase. The same speed -benefit held true for writes as well, iirc. It's been a while, but I -think it was around page 330. The index has it listed and it's not -that hard of an example to find. -sc - ---=20 -Sean Chittenden - ---HjNkcEWJ4DMx36DP -Content-Type: application/pgp-signature -Content-Disposition: inline - ------BEGIN PGP SIGNATURE----- -Comment: Sean Chittenden - -iD8DBQE+Z+mY3ZnjH7yEs0ERAjVkAJwMI1V7+HvMAA5ODadD5znsekI8TQCgvH0C -KwvG7YLsJ+xpsTUS67KD+4M= -=w8/7 ------END PGP SIGNATURE----- - ---HjNkcEWJ4DMx36DP-- - -From pgsql-performance-owner+M1354=pgman=candle.pha.pa.us@postgresql.org Fri Mar 7 01:09:07 2003 -Return-path: -Received: from relay2.pgsql.com (relay2.pgsql.com [64.49.215.143]) - by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id h27693604295 - for ; Fri, 7 Mar 2003 01:09:05 -0500 (EST) -Received: from postgresql.org (postgresql.org [64.49.215.8]) - by relay2.pgsql.com (Postfix) with ESMTP id 95CD2EDFD3B - for ; Fri, 7 Mar 2003 01:09:03 -0500 (EST) -X-Original-To: pgsql-performance@postgresql.org -Received: from perrin.int.nxad.com (internal.ext.nxad.com [69.1.70.251]) - by postgresql.org (Postfix) with ESMTP id F16034768E2 - for ; Fri, 7 Mar 2003 01:04:33 -0500 (EST) -Received: by perrin.int.nxad.com (Postfix, from userid 1001) - id 7969A21065; Thu, 6 Mar 2003 22:04:12 -0800 (PST) -Date: Thu, 6 Mar 2003 22:04:12 -0800 -From: Sean Chittenden -To: Neil Conway -cc: Tom Lane , - Christopher Kings-Lynne , - PostgreSQL Performance -Subject: Re: [PERFORM] [COMMITTERS] pgsql-server/ /configure /configure.in rc/incl ... -Message-ID: <20030307060412.GA19138@perrin.int.nxad.com> -References: <20030306031656.1876F4762E0@postgresql.org> <032f01c2e390$b1842b20$6500a8c0@fhp.internal> <11077.1046921667@sss.pgh.pa.us> <033f01c2e392$71476570$6500a8c0@fhp.internal> <12228.1046922471@sss.pgh.pa.us> <20030306094117.GA79234@perrin.int.nxad.com> <15071.1046964336@sss.pgh.pa.us> <20030307003640.GF79234@perrin.int.nxad.com> <1046998072.10527.67.camel@tokyo> -MIME-Version: 1.0 -Content-Type: multipart/signed; micalg=pgp-sha1; - protocol="application/pgp-signature"; boundary="KsGdsel6WgEHnImy" -Content-Disposition: inline -In-Reply-To: <1046998072.10527.67.camel@tokyo> -User-Agent: Mutt/1.4i -X-PGP-Key: finger seanc@FreeBSD.org -X-PGP-Fingerprint: 3849 3760 1AFE 7B17 11A0 83A6 DD99 E31F BC84 B341 -X-Web-Homepage: http://sean.chittenden.org/ -Precedence: bulk -Sender: pgsql-performance-owner@postgresql.org -Status: OR - ---KsGdsel6WgEHnImy -Content-Type: text/plain; charset=us-ascii -Content-Disposition: inline -Content-Transfer-Encoding: quoted-printable - -> > I don't have my copy of Steven's handy (it's some 700mi away atm -> > otherwise I'd cite it), but if Tom or someone else has it handy, look -> > up the example re: the performance gain from read()'ing an mmap()'ed -> > file versus a non-mmap()'ed file. The difference is non-trivial and -> > _WELL_ worth the time given the speed increase. ->=20 -> Can anyone confirm this? If so, one easy step we could take in this -> direction would be adapting COPY FROM to use mmap(). - -Weeee! Alright, so I got to have some fun writing out some simple -tests with mmap() and friends tonight. Are the results interesting? -Absolutely! Is this a simple benchmark? Yup. Do I think it -simulates PostgreSQL? Eh, not particularly. Does it demonstrate that -mmap() is a win and something worth implementing? I sure hope so. Is -this a test program to demonstrate the ideal use of mmap() in -PostgreSQL? No. Is it a place to start a factual discussion? I hope -so. - -I have here four tests that are conditionalized by cpp. - -# The first one uses read() and write() but with the buffer size set -# to the same size as the file. -gcc -O3 -finline-functions -fkeep-inline-functions -funroll-loops -o test-= -mmap test-mmap.c -/usr/bin/time ./test-mmap > /dev/null -Beginning tests with file: services - -Page size: 4096 -File read size is the same as the file size -Number of iterations: 100000 -Start time: 1047013002.412516 -Time: 82.88178 - -Completed tests - 82.09 real 2.13 user 68.98 sys - -# The second one uses read() and write() with the default buffer size: -# 65536 -gcc -O3 -finline-functions -fkeep-inline-functions -funroll-loops -DDEFAUL= -T_READSIZE=3D1 -o test-mmap test-mmap.c -/usr/bin/time ./test-mmap > /dev/null -Beginning tests with file: services - -Page size: 4096 -File read size is default read size: 65536 -Number of iterations: 100000 -Start time: 1047013085.16204 -Time: 18.155511 - -Completed tests - 18.16 real 0.90 user 14.79 sys -# Please note this is significantly faster, but that's expected - -# The third test uses mmap() + madvise() + write() -gcc -O3 -finline-functions -fkeep-inline-functions -funroll-loops -DDEFAUL= -T_READSIZE=3D1 -DDO_MMAP=3D1 -o test-mmap test-mmap.c -/usr/bin/time ./test-mmap > /dev/null -Beginning tests with file: services - -Page size: 4096 -File read size is the same as the file size -Number of iterations: 100000 -Start time: 1047013103.859818 -Time: 8.4294203644 - -Completed tests - 7.24 real 0.41 user 5.92 sys -# Faster still, and twice as fast as the normal read() case - -# The last test only calls mmap()'s once when the file is opened and -# only msync()'s, munmap()'s, close()'s the file once at exit. -gcc -O3 -finline-functions -fkeep-inline-functions -funroll-loops -DDEFAUL= -T_READSIZE=3D1 -DDO_MMAP=3D1 -DDO_MMAP_ONCE=3D1 -o test-mmap test-mmap.c -/usr/bin/time ./test-mmap > /dev/null -Beginning tests with file: services - -Page size: 4096 -File read size is the same as the file size -Number of iterations: 100000 -Start time: 1047013111.623712 -Time: 1.174076 - -Completed tests - 1.18 real 0.09 user 0.92 sys -# Substantially faster - - -Obviously this isn't perfect, but reading and writing data is faster -(specifically moving pages through the VM/OS). Doing partial writes -from mmap()'ed data should be faster along with scanning through -mmap()'ed portions of - or completely mmap()'ed - files because the -pages are already loaded in the VM. PostgreSQL's LRU file descriptor -cache could easily be adjusted to add mmap()'ing of frequently -accessed files (specifically, system catalogs come to mind). It's not -hard to figure out how often particular files are accessed and to -either _avoid_ mmap()'ing a file that isn't accessed often, or to -mmap() files that _are_ accessed often. mmap() does have a cost, but -I'd wager that mmap()'ing the same file a second or third time from a -different process would be more efficient. The speedup of searching -through an mmap()'ed file may be worth it, however, to mmap() all -files if the system is under a tunable resource limit -(max_mmaped_bytes?). - -If someone is so inclined or there's enough interest, I can reverse -this test case so that data is written to an mmap()'ed file, but the -same performance difference should hold true (assuming this isn't a -write to a tape drive ::grin::). - -The URL for the program used to generate the above tests is at: - -http://people.freebsd.org/~seanc/mmap_test/ - - -Please ask if you have questions. -sc - ---=20 -Sean Chittenden - ---KsGdsel6WgEHnImy -Content-Type: application/pgp-signature -Content-Disposition: inline - ------BEGIN PGP SIGNATURE----- -Comment: Sean Chittenden - -iD8DBQE+aDZc3ZnjH7yEs0ERAid6AJ9/TAYMUx2+ZcD2680OlKJBj5FzrACgquIG -PBNCzM0OegBXrPROJ/uIKDM= -=y7O6 ------END PGP SIGNATURE----- - ---KsGdsel6WgEHnImy-- - -From pgsql-performance-owner+M1358=pgman=candle.pha.pa.us@postgresql.org Fri Mar 7 16:47:38 2003 -Return-path: -Received: from relay2.pgsql.com (relay2.pgsql.com [64.49.215.143]) - by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id h27LlX429809 - for ; Fri, 7 Mar 2003 16:47:35 -0500 (EST) -Received: from postgresql.org (postgresql.org [64.49.215.8]) - by relay2.pgsql.com (Postfix) with ESMTP id D40CBEDFE05 - for ; Fri, 7 Mar 2003 16:47:32 -0500 (EST) -X-Original-To: pgsql-performance@postgresql.org -Received: from perrin.int.nxad.com (internal.ext.nxad.com [69.1.70.251]) - by postgresql.org (Postfix) with ESMTP id 913B5474E44 - for ; Fri, 7 Mar 2003 16:46:50 -0500 (EST) -Received: by perrin.int.nxad.com (Postfix, from userid 1001) - id A55392105B; Fri, 7 Mar 2003 13:46:30 -0800 (PST) -Date: Fri, 7 Mar 2003 13:46:30 -0800 -From: Sean Chittenden -To: Tom Lane -cc: Neil Conway , - Christopher Kings-Lynne , - PostgreSQL Performance -Subject: Re: [PERFORM] [COMMITTERS] pgsql-server/ /configure /configure.in rc/incl ... -Message-ID: <20030307214630.GI79234@perrin.int.nxad.com> -References: <032f01c2e390$b1842b20$6500a8c0@fhp.internal> <11077.1046921667@sss.pgh.pa.us> <033f01c2e392$71476570$6500a8c0@fhp.internal> <12228.1046922471@sss.pgh.pa.us> <20030306094117.GA79234@perrin.int.nxad.com> <15071.1046964336@sss.pgh.pa.us> <20030307003640.GF79234@perrin.int.nxad.com> <1046998072.10527.67.camel@tokyo> <20030307060412.GA19138@perrin.int.nxad.com> <29933.1047047386@sss.pgh.pa.us> -MIME-Version: 1.0 -Content-Type: multipart/signed; micalg=pgp-sha1; - protocol="application/pgp-signature"; boundary="TALVG7vV++YnpwZG" -Content-Disposition: inline -In-Reply-To: <29933.1047047386@sss.pgh.pa.us> -User-Agent: Mutt/1.4i -X-PGP-Key: finger seanc@FreeBSD.org -X-PGP-Fingerprint: 3849 3760 1AFE 7B17 11A0 83A6 DD99 E31F BC84 B341 -X-Web-Homepage: http://sean.chittenden.org/ -Precedence: bulk -Sender: pgsql-performance-owner@postgresql.org -Status: OR - ---TALVG7vV++YnpwZG -Content-Type: text/plain; charset=us-ascii -Content-Disposition: inline -Content-Transfer-Encoding: quoted-printable - -> > Absolutely! Is this a simple benchmark? Yup. Do I think it -> > simulates PostgreSQL? Eh, not particularly. - -I think quite a few of these Q's would have been answered by reading -the code/Makefile.... - -> This would be on what OS? - -FreeBSD, but it shouldn't matter. Any reasonably written VM should -have similar numbers (though BSD is generally regarded as having the -best VM, which, I think Linux poached not that long ago, iirc -::grimace::). - -> What hardware? - -My ultra-pathetic laptop with some fine - overly-noisy and can hardly -buildworld - IDE drives. - -> What size test file? - -In this case, only 72K. I've just updated the test program to use an -array of files though. - -> Do the "iterations" mean so many reads of the entire file, or so -> many buffer-sized read requests? - -In some cases, yes. With the file mmap()'ed, sorta. One of the test -cases (the one that did it in ~8s), mmap()'ed and munmap()'ed the file -every iteration and was twice as fast as the vanilla read() call. - -> Did the mmap case actually *read* anything, or just map and unmap -> the file? - -Nope, read it and wrote it out to stdout (which was redirected to -/dev/null). - -> Also, what did you do to normalize for the effects of the test file -> being already in kernel disk cache after the first test? - -That honestly doesn't matter too much since I wasn't testing the rate -of reading in files from my hard drive, only the OS's ability to -read/write pages of data around. In any case, I've updated my test -case to iterate through an array of files instead of just reading in a -copy of /etc/services. My laptop is generally a poor benchmark for -disk read performance given it takes 8hrs to buildworld, over 12hrs to -build mozilla, 18 for KDE, and about 48hrs for Open Office. :) -Someone with faster disks may want to try this and report back, but it -doesn't matter much in terms of relevancy for considering the benefits -of mmap(). The point is that there are calls that can be used that -substantially speed up read()'s and write()'s by allowing the VM to -align pages of data and give hints about its usage. For the sake of -argument re: the previously done tests, I'll reverse the order in -which I ran them and I bet dime to dollar that the times will be -identical. - -% make = - ~/open_source/mmap_test -cp -f /etc/services ./services -gcc -O3 -finline-functions -fkeep-inline-functions -funroll-loops -DDEFAUL= -T_READSIZE=3D1 -DDO_MMAP=3D1 -DDO_MMAP_ONCE=3D1 -o mmap-test mmap-test.c -/usr/bin/time ./mmap-test > /dev/null -Beginning tests with file: services - -Page size: 4096 -File read size is the same as the file size -Number of iterations: 100000 -Start time: 1047064672.276544 -Time: 1.281477 - -Completed tests - 1.29 real 0.10 user 0.92 sys -gcc -O3 -finline-functions -fkeep-inline-functions -funroll-loops -DDEFAUL= -T_READSIZE=3D1 -DDO_MMAP=3D1 -o mmap-test mmap-test.c -/usr/bin/time ./mmap-test > /dev/null -Beginning tests with file: services - -Page size: 4096 -File read size is the same as the file size -Number of iterations: 100000 -Start time: 1047064674.266191 -Time: 7.486622 - -Completed tests - 7.49 real 0.41 user 6.01 sys -gcc -O3 -finline-functions -fkeep-inline-functions -funroll-loops -DDEFAUL= -T_READSIZE=3D1 -o mmap-test mmap-test.c -/usr/bin/time ./mmap-test > /dev/null -Beginning tests with file: services - -Page size: 4096 -File read size is default read size: 65536 -Number of iterations: 100000 -Start time: 1047064682.288637 -Time: 19.35214 - -Completed tests - 19.04 real 0.88 user 15.43 sys -gcc -O3 -finline-functions -fkeep-inline-functions -funroll-loops -o mmap-= -test mmap-test.c -/usr/bin/time ./mmap-test > /dev/null -Beginning tests with file: services - -Page size: 4096 -File read size is the same as the file size -Number of iterations: 100000 -Start time: 1047064701.867031 -Time: 82.4294540875 - -Completed tests - 81.57 real 2.10 user 69.55 sys - - -Here's the updated test that iterates through. Ooh! One better, the -files I've used are actual data files from ~pgsql. The new benchmark -iterates through the list of files and and calls bench() once for each -file and restarts at the first file after reaching the end of its -list (ARGV). - -Whoa, if these tests are even close to real world, then we at the very -least should be mmap()'ing the file every time we read it (assuming -we're reading more than just a handful of bytes): - -find /usr/local/pgsql/data -type f | /usr/bin/xargs /usr/bin/time ./mmap-te= -st > /dev/null -Page size: 4096 -File read size is the same as the file size -Number of iterations: 100000 -Start time: 1047071143.463360 -Time: 12.109530 - -Completed tests - 12.11 real 0.36 user 6.80 sys - -find /usr/local/pgsql/data -type f | /usr/bin/xargs /usr/bin/time ./mmap-te= -st > /dev/null -Page size: 4096 -File read size is default read size: 65536 -Number of iterations: 100000 -.... [been waiting here for >40min now....] - - -Ah well, if these tests finish this century, I'll post the results in -a bit, but it's pretty clearly a win. In terms of the data that I'm -copying, I'm copying ~700MB of data from my test DB on my laptop. I -only have 256MB of RAM so I can pretty much promise you that the data -isn't in my system buffers. If anyone else would like to run the -tests or look at the results, please check it out: - -o1 and o2 should be the only targets used if FILES is bigger than the -RAM on the system. o3's by far and away the fastest, but only in rare -cases will a DBA have more RAM than data. But, as mentioned earlier, -the LRU cache could easily be modified to munmap() infrequently -accessed files to keep the size of mmap()'ed data down to a reasonable -level. - -The updated test programs are at: - -http://people.FreeBSD.org/~seanc/mmap_test/ - --sc - ---=20 -Sean Chittenden - ---TALVG7vV++YnpwZG -Content-Type: application/pgp-signature -Content-Disposition: inline - ------BEGIN PGP SIGNATURE----- -Comment: Sean Chittenden - -iD8DBQE+aRM23ZnjH7yEs0ERAoqhAKCFgmhpvNMqe9tucoFvK1H6J50z2QCeIZEI -mgBHwu/H1pe1sXIX9UG2V+I= -=cFRQ ------END PGP SIGNATURE----- - ---TALVG7vV++YnpwZG-- -