--- /dev/null
+<html>
+<head>
+<title>Apache Performance Notes</title>
+</head>
+<body bgcolor="#ffffff" text="#000000" link="#0000ff" vlink="#000080" alink="#ff0000">
+
+<h1>Apache Performance Notes</h1>
+
+<p>Author: Dean Gaudet
+
+<h3>Introduction</h3>
+<p>Apache is a general webserver, which is designed to be correct first, and
+fast second. Even so, it's performance is quite satisfactory. Most
+sites have less than 10Mbits of outgoing bandwidth, which Apache can
+fill using only a low end Pentium-based webserver. In practice sites
+with more bandwidth require more than one machine to fill the bandwidth
+due to other constraints (such as CGI or database transaction overhead).
+For these reasons the development focus has been mostly on correctness
+and configurability.
+
+<p>Unfortunately many folks overlook these facts and cite raw performance
+numbers as if they are some indication of the quality of a web server
+product. There is a bare minimum performance that is acceptable, beyond
+that extra speed only caters to a much smaller segment of the market.
+But in order to avoid this hurdle to the acceptance of Apache in some
+markets, effort was put into Apache 1.3 to bring performance up to a
+point where the difference with other high-end webservers is minimal.
+
+<p>Finally there are the folks who just plain want to see how fast something
+can go. The author falls into this category. The rest of this document
+is dedicated to these folks who want to squeeze every last bit of
+performance out of Apache's current model, and want to understand why
+it does some things which slow it down.
+
+<p>Note that this is tailored towards Apache 1.3 on Unix. Some of it applies
+to Apache on NT. Apache on NT has not been tuned for performance yet,
+in fact it probably performs very poorly because NT performance requires
+a different programming model.
+
+<h3>Hardware and Operating System Issues</h3>
+
+<p>The single biggest hardware issue affecting webserver performance
+is RAM. A webserver should never ever have to swap, swapping increases
+the latency of each request beyond a point that users consider "fast
+enough". This causes users to hit stop and reload, further increasing
+the load. You can, and should, control the <code>MaxClients</code>
+setting so that your server does not spawn so many children it starts
+swapping.
+
+<p>Beyond that the rest is mundane: get a fast enough CPU, a fast enough
+network card, and fast enough disks, where "fast enough" is something
+that needs to be determined by experimentation.
+
+<p>Operating system choice is largely a matter of local concerns. But
+a general guideline is to always apply the latest vendor TCP/IP patches.
+HTTP serving completely breaks many of the assumptions built into Unix
+kernels up through 1994 and even 1995. Good choices include
+recent FreeBSD, and Linux.
+
+<h3>Run-Time Configuration Issues</h3>
+
+<h4>HostnameLookups</h4>
+<p>Prior to Apache 1.3, <code>HostnameLookups</code> defaulted to On.
+This adds latency
+to every request because it requires a DNS lookup to complete before
+the request is finished. In Apache 1.3 this setting defaults to Off.
+However (1.3 or later), if you use any <code>allow from domain</code> or
+<code>deny from domain</code> directives then you will pay for a
+double reverse DNS lookup (a reverse, followed by a forward to make sure
+that the reverse is not being spoofed). So for the highest performance
+avoid using these directives (it's fine to use IP addresses rather than
+domain names).
+
+<p>Note that it's possible to scope the directives, such as within
+a <code><Location /server-status></code> section. In this
+case the DNS lookups are only performed on requests matching the
+criteria. Here's an example which disables
+lookups except for .html and .cgi files:
+
+<blockquote><pre>
+HostnameLookups off
+<Files ~ "\.(html|cgi)$>
+ HostnameLookups on
+</Files>
+</pre></blockquote>
+
+But even still, if you just need DNS names
+in some CGIs you could consider doing the
+<code>gethostbyname</code> call in the specific CGIs that need it.
+
+<h4>FollowSymLinks and SymLinksIfOwnerMatch</h4>
+<p>Wherever in your URL-space you do not have an
+<code>Options FollowSymLinks</code>, or you do have an
+<code>Options SymLinksIfOwnerMatch</code> Apache will have to
+issue extra system calls to check up on symlinks. One extra call per
+filename component. For example, if you had:
+
+<blockquote><pre>
+DocumentRoot /www/htdocs
+<Directory />
+ Options SymLinksIfOwnerMatch
+</Directory>
+</pre></blockquote>
+
+and a request is made for the URI <code>/index.html</code>.
+Then Apache will perform <code>lstat(2)</code> on <code>/www</code>,
+<code>/www/htdocs</code>, and <code>/www/htdocs/index.html</code>. The
+results of these <code>lstats</code> are never cached,
+so they will occur on every single request. If you really desire the
+symlinks security checking you can do something like this:
+
+<blockquote><pre>
+DocumentRoot /www/htdocs
+<Directory />
+ Options FollowSymLinks
+</Directory>
+<Directory /www/htdocs>
+ Options -FollowSymLinks +SymLinksIfOwnerMatch
+</Directory>
+</pre></blockquote>
+
+This at least avoids the extra checks for the <code>DocumentRoot</code>
+path. Note that you'll need to add similar sections if you have any
+<code>Alias</code> or <code>RewriteRule</code> paths outside of your
+document root. For highest performance, and no symlink protection,
+set <code>FollowSymLinks</code> everywhere, and never set
+<code>SymLinksIfOwnerMatch</code>.
+
+<h4>AllowOverride</h4>
+
+<p>Wherever in your URL-space you allow overrides (typically
+<code>.htaccess</code> files) Apache will attempt to open
+<code>.htaccess</code> for each filename component. For example,
+
+<blockquote><pre>
+DocumentRoot /www/htdocs
+<Directory />
+ AllowOverride all
+</Directory>
+</pre></blockquote>
+
+and a request is made for the URI <code>/index.html</code>. Then
+Apache will attempt to open <code>/.htaccess</code>,
+<code>/www/.htaccess</code>, and <code>/www/htdocs/.htaccess</code>.
+The solutions are similar to the previous case of <code>Options
+FollowSymLinks</code>. For highest performance use
+<code>AllowOverride None</code> everywhere in your filesystem.
+
+<h4>Negotiation</h4>
+
+<p>If at all possible, avoid content-negotiation if you're really
+interested in every last ounce of performance. In practice the
+benefits of negotiation outweigh the performance penalties. There's
+one case where you can speed up the server. Instead of using
+a wildcard such as:
+
+<blockquote><pre>
+DirectoryIndex index
+</pre></blockquote>
+
+Use a complete list of options:
+
+<blockquote><pre>
+DirectoryIndex index.cgi index.pl index.shtml index.html
+</pre></blockquote>
+
+where you list the most common choice first.
+
+<h4>Process Creation</h4>
+
+<p>Prior to Apache 1.3 the <code>MinSpareServers</code>,
+<code>MaxSpareServers</code>, and <code>StartServers</code> settings
+all had drastic effects on benchmark results. In particular, Apache
+required a "ramp-up" period in order to reach a number of children
+sufficient to serve the load being applied. After the initial
+spawning of <code>StartServers</code> children, only one child per
+second would be created to satisfy the <code>MinSpareServers</code>
+setting. So a server being accessed by 100 simultaneous clients,
+using the default <code>StartServers</code> of 5 would take on
+the order 95 seconds to spawn enough children to handle the load. This
+works fine in practice on real-life servers, because they aren't restarted
+frequently. But does really poorly on benchmarks which might only run
+for ten minutes.
+
+<p>The one-per-second rule was implemented in an effort to avoid
+swamping the machine with the startup of new children. If the machine
+is busy spawning children it can't service requests. But it has such
+a drastic effect on the perceived performance of Apache that it had
+to be replaced. As of Apache 1.3,
+the code will relax the one-per-second rule. It
+will spawn one, wait a second, then spawn two, wait a second, then spawn
+four, and it will continue exponentially until it is spawning 32 children
+per second. It will stop whenever it satisfies the
+<code>MinSpareServers</code> setting.
+
+<p>This appears to be responsive enough that it's
+almost unnecessary to twiddle the <code>MinSpareServers</code>,
+<code>MaxSpareServers</code> and <code>StartServers</code> knobs. When
+more than 4 children are spawned per second, a message will be emitted
+to the <code>ErrorLog</code>. If you see a lot of these errors then
+consider tuning these settings. Use the <code>mod_status</code> output
+as a guide.
+
+<p>Related to process creation is process death induced by the
+<code>MaxRequestsPerChild</code> setting. By default this is 30, which
+is probably far too low unless your server is using a module such as
+<code>mod_perl</code> which causes children to have bloated memory
+images. If your server is serving mostly static pages then consider
+raising this value to something like 10000. The code is robust enough
+that this shouldn't be a problem.
+
+<p>When keep-alives are in use, children will be kept busy
+doing nothing waiting for more requests on the already open
+connection. The default <code>KeepAliveTimeout</code> of
+15 seconds attempts to minimize this effect. The tradeoff
+here is between network bandwidth and server resources.
+In no event should you raise this above about 60 seconds, as
+<a href="http://www.research.digital.com/wrl/techreports/abstracts/95.4.html">
+most of the benefits are lost</a>.
+
+<h3>Compile-Time Configuration Issues</h3>
+
+<h4>mod_status and Rule STATUS=yes</h4>
+
+<p>If you include <code>mod_status</code>
+and you also set <code>Rule STATUS=yes</code> when building
+Apache, then on every request Apache will perform two calls to
+<code>gettimeofday(2)</code> (or <code>times(2)</code> depending
+on your operating system), and (pre-1.3) several extra calls to
+<code>time(2)</code>. This is all done so that the status report
+contains timing indications. For highest performance, set <code>Rule
+STATUS=no</code>.
+
+<h4>accept Serialization - multiple sockets</h4>
+
+<p>This discusses a shortcoming in the Unix socket API.
+Suppose your
+web server uses multiple <code>Listen</code> statements to listen on
+either multiple ports or multiple addresses. In order to test each
+socket to see if a connection is ready Apache uses <code>select(2)</code>.
+<code>select(2)</code> indicates that a socket has <i>none</i> or
+<i>at least one</i> connection waiting on it. Apache's model includes
+multiple children, and all the idle ones test for new connections at the
+same time. A naive implementation looks something like this
+(these examples do not match the code, they're contrived for
+pedagogical purposes):
+
+<blockquote><pre>
+ for (;;) {
+ for (;;) {
+ fd_set accept_fds;
+
+ FD_ZERO (&accept_fds);
+ for (i = first_socket; i <= last_socket; ++i) {
+ FD_SET (i, &accept_fds);
+ }
+ rc = select (last_socket+1, &accept_fds, NULL, NULL, NULL);
+ if (rc < 1) continue;
+ new_connection = -1;
+ for (i = first_socket; i <= last_socket; ++i) {
+ if (FD_ISSET (i, &accept_fds)) {
+ new_connection = accept (i, NULL, NULL);
+ if (new_connection != -1) break;
+ }
+ }
+ if (new_connection != -1) break;
+ }
+ process the new_connection;
+ }
+</pre></blockquote>
+
+But this naive implementation has a serious starvation problem. Recall
+that multiple children execute this loop at the same time, and so multiple
+children will block at <code>select</code> when they are in between
+requests. All those blocked children will awaken and return from
+<code>select</code> when a single request appears on any socket
+(the number of children which awaken varies depending on the operating
+system and timing issues).
+They will all then fall down into the loop and try to <code>accept</code>
+the connection. But only one will succeed (assuming there's still only
+one connection ready), the rest will be <i>blocked</i> in <code>accept</code>.
+This effectively locks those children into serving requests from that
+one socket and no other sockets, and they'll be stuck there until enough
+new requests appear on that socket to wake them all up.
+This starvation problem was first documented in
+<a href="http://bugs.apache.org/index/full/467">PR#467</a>. There
+are at least two solutions.
+
+<p>One solution is to make the sockets non-blocking. In this case the
+<code>accept</code> won't block the children, and they will be allowed
+to continue immediately. But this wastes CPU time. Suppose you have
+ten idle children in <code>select</code>, and one connection arrives.
+Then nine of those children will wake up, try to <code>accept</code> the
+connection, fail, and loop back into <code>select</code>, accomplishing
+nothing. Meanwhile none of those children are servicing requests that
+occurred on other sockets until they get back up to the <code>select</code>
+again. Overall this solution does not seem very fruitful unless you
+have as many idle CPUs (in a multiprocessor box) as you have idle children,
+not a very likely situation.
+
+<p>Another solution, the one used by Apache, is to serialize entry into
+the inner loop. The loop looks like this (differences highlighted):
+
+<blockquote><pre>
+ for (;;) {
+ <b>accept_mutex_on ();</b>
+ for (;;) {
+ fd_set accept_fds;
+
+ FD_ZERO (&accept_fds);
+ for (i = first_socket; i <= last_socket; ++i) {
+ FD_SET (i, &accept_fds);
+ }
+ rc = select (last_socket+1, &accept_fds, NULL, NULL, NULL);
+ if (rc < 1) continue;
+ new_connection = -1;
+ for (i = first_socket; i <= last_socket; ++i) {
+ if (FD_ISSET (i, &accept_fds)) {
+ new_connection = accept (i, NULL, NULL);
+ if (new_connection != -1) break;
+ }
+ }
+ if (new_connection != -1) break;
+ }
+ <b>accept_mutex_off ();</b>
+ process the new_connection;
+ }
+</pre></blockquote>
+
+<a name="serialize"></a>
+The functions <code>accept_mutex_on</code> and <code>accept_mutex_off</code>
+implement a mutual exclusion semaphore. Only one child can have the
+mutex at any time. There are several choices for implementing these
+mutexes. The choice is defined in <code>src/conf.h</code> (pre-1.3) or
+<code>src/main/conf.h</code> (1.3 or later). Some architectures
+do not have any locking choice made, on these architectures it is unsafe
+to use multiple <code>Listen</code> directives.
+
+<dl>
+<dt><code>USE_FLOCK_SERIALIZED_ACCEPT</code>
+<dd>This method uses the <code>flock(2)</code> system call to lock a
+lock file (located by the <code>LockFile</code> directive).
+
+<dt><code>USE_FCNTL_SERIALIZED_ACCEPT</code>
+<dd>This method uses the <code>fcntl(2)</code> system call to lock a
+lock file (located by the <code>LockFile</code> directive).
+
+<dt><code>USE_SYSVSEM_SERIALIZED_ACCEPT</code>
+<dd>(1.3 or later) This method uses SysV-style semaphores to implement the
+mutex. Unfortunately SysV-style semaphores have some bad side-effects.
+One is that it's possible Apache will die without cleaning up the semaphore
+(see the <code>ipcs(8)</code> man page). The other is that the semaphore
+API allows for a denial of service attack by any CGIs running under the
+same uid as the webserver (i.e. all CGIs unless you use something
+like suexec or cgiwrapper). For these reasons this method is not used
+on any architecture except IRIX (where the previous two are prohibitively
+expensive on most IRIX boxes).
+
+<dt><code>USE_USLOCK_SERIALIZED_ACCEPT</code>
+<dd>(1.3 or later) This method is only available on IRIX, and uses
+<code>usconfig(2)</code> to create a mutex. While this method avoids
+the hassles of SysV-style semaphores, it is not the default for IRIX.
+This is because on single processor IRIX boxes (5.3 or 6.2) the
+uslock code is two orders of magnitude slower than the SysV-semaphore
+code. On multi-processor IRIX boxes the uslock code is an order of magnitude
+faster than the SysV-semaphore code. Kind of a messed up situation.
+So if you're using a multiprocessor IRIX box then you should rebuild your
+webserver with <code>-DUSE_USLOCK_SERIALIZED_ACCEPT</code> on the
+<code>EXTRA_CFLAGS</code>.
+
+<dt><code>USE_PTHREADS_SERIALIZED_ACCEPT</code>
+<dd>(1.3 or later) This method uses POSIX mutexes and should work on
+any architecture implementing the full POSIX threads specification,
+however appears to only work on Solaris (2.5 or later). This is the
+default for Solaris 2.5 or later.
+</dl>
+
+<p>If your system has another method of serialization which isn't in the
+above list then it may be worthwhile adding code for it (and submitting
+a patch back to Apache).
+
+<p>Another solution that has been considered but never implemented is
+to partially serialize the loop -- that is, let in a certain number
+of processes. This would only be of interest on multiprocessor boxes
+where it's possible multiple children could run simultaneously, and the
+serialization actually doesn't take advantage of the full bandwidth.
+This is a possible area of future investigation, but priority remains
+low because highly parallel web servers are not the norm.
+
+<p>Ideally you should run servers without multiple <code>Listen</code>
+statements if you want the highest performance. But read on.
+
+<h4>accept Serialization - single socket</h4>
+
+<p>The above is fine and dandy for multiple socket servers, but what
+about single socket servers? In theory they shouldn't experience
+any of these same problems because all children can just block in
+<code>accept(2)</code> until a connection arrives, and no starvation
+results. In practice this hides almost the same "spinning" behaviour
+discussed above in the non-blocking solution. The way that most TCP
+stacks are implemented, the kernel actually wakes up all processes blocked
+in <code>accept</code> when a single connection arrives. One of those
+processes gets the connection and returns to user-space, the rest spin in
+the kernel and go back to sleep when they discover there's no connection
+for them. This spinning is hidden from the user-land code, but it's
+there nonetheless. This can result in the same load-spiking wasteful
+behaviour that a non-blocking solution to the multiple sockets case can.
+
+<p>For this reason we have found that many architectures behave more
+"nicely" if we serialize even the single socket case. So this is
+actually the default in almost all cases. Crude experiments under
+Linux (2.0.30 on a dual Pentium pro 166 w/128Mb RAM) have shown that
+the serialization of the single socket case causes less than a 3%
+decrease in requests per second over unserialized single-socket.
+But unserialized single-socket showed an extra 100ms latency on
+each request. This latency is probably a wash on long haul lines,
+and only an issue on LANs. If you want to override the single socket
+serialization you can define <code>SAFE_UNSERIALIZED_ACCEPT</code>
+and then single-socket servers will not serialize at all.
+
+<h4>Lingering Close</h4>
+
+<p>As discussed in
+<a href="ftp://ds.internic.net/internet-drafts/draft-ietf-http-connection-00.txt">draft-ietf-http-connection-00.txt</a> section 8,
+in order for an HTTP server to <b>reliably</b> implement the protocol
+it needs to shutdown each direction of the communication independently
+(recall that a TCP connection is bi-directional, each half is independent
+of the other). This fact is often overlooked by other servers, but
+is correctly implemented in Apache as of 1.2.
+
+<p>When this feature was added to Apache it caused a flurry of
+problems on various versions of Unix because of a shortsightedness.
+The TCP specification does not state that the FIN_WAIT_2 state has a
+timeout, but it doesn't prohibit it. On systems without the timeout,
+Apache 1.2 induces many sockets stuck forever in the FIN_WAIT_2 state.
+In many cases this can be avoided by simply upgrading to the latest
+TCP/IP patches supplied by the vendor, in cases where the vendor has
+never released patches (i.e. SunOS4 -- although folks with a source
+license can patch it themselves) we have decided to disable this feature.
+
+<p>There are two ways of accomplishing this. One is the
+socket option <code>SO_LINGER</code>. But as fate would have it,
+this has never been implemented properly in most TCP/IP stacks. Even
+on those stacks with a proper implementation (i.e. Linux 2.0.31) this
+method proves to be more expensive (cputime) than the next solution.
+
+<p>For the most part, Apache implements this in a function called
+<code>lingering_close</code> (in <code>http_main.c</code>). The
+function looks roughly like this:
+
+<blockquote><pre>
+ void lingering_close (int s)
+ {
+ char junk_buffer[2048];
+
+ /* shutdown the sending side */
+ shutdown (s, 1);
+
+ signal (SIGALRM, lingering_death);
+ alarm (30);
+
+ for (;;) {
+ select (s for reading, 2 second timeout);
+ if (error) break;
+ if (s is ready for reading) {
+ read (s, junk_buffer, sizeof (junk_buffer));
+ /* just toss away whatever is here */
+ }
+ }
+
+ close (s);
+ }
+</pre></blockquote>
+
+This naturally adds some expense at the end of a connection, but it
+is required for a reliable implementation. As HTTP/1.1 becomes more
+prevalent, and all connections are persistent, this expense will be
+amortized over more requests. If you want to play with fire and
+disable this feature you can define <code>NO_LINGCLOSE</code>, but
+this is not recommended at all. In particular, as HTTP/1.1 pipelined
+persistent connections come into use <code>lingering_close</code>
+is an absolute necessity (and
+<a href="http://www.w3.org/Protocols/HTTP/Performance/Pipeline.html">
+pipelined connections are faster</a>, so you
+want to support them).
+
+<h4>Scoreboard File</h4>
+
+<p>Apache's parent and children communicate with each other through
+something called the scoreboard. Ideally this should be implemented
+in shared memory. For those operating systems that we either have
+access to, or have been given detailed ports for, it typically is
+implemented using shared memory. The rest default to using an
+on-disk file. The on-disk file is not only slow, but it is unreliable
+(and less featured). Peruse the <code>src/main/conf.h</code> file
+for your architecture and look for either <code>HAVE_MMAP</code> or
+<code>HAVE_SHMGET</code>. Defining one of those two enables the
+supplied shared memory code. If your system has another type of
+shared memory then edit the file <code>src/main/http_main.c</code> and
+add the hooks necessary to use it in Apache. (Send us back a patch
+too please.)
+
+<p>Historical note: The Linux port of Apache didn't start to use
+shared memory until version 1.2 of Apache. This oversight resulted
+in really poor and unreliable behaviour of earlier versions of Apache
+on Linux.
+
+<h4><code>DYNAMIC_MODULE_LIMIT</code></h4>
+
+<p>If you have no intention of using dynamically loaded modules
+(you probably don't if you're reading this and tuning your
+server for every last ounce of performance) then you should add
+<code>-DDYNAMIC_MODULE_LIMIT=0</code> when building your server.
+This will save RAM that's allocated only for supporting dynamically
+loaded modules.
+
+<h3>Appendix: Detailed Analysis of a Trace</h3>
+
+Here is a system call trace of Apache 1.3 running on Linux. The run-time
+configuration file is essentially the default plus:
+
+<blockquote><pre>
+<Directory />
+ AllowOverride none
+ Options FollowSymLinks
+</Directory>
+</pre></blockquote>
+
+The file being requested is a static 6K file of no particular content.
+Traces of non-static requests or requests with content negotiation
+look wildly different (and quite ugly in some cases). First the
+entire trace, then we'll examine details. (This was generated by
+the <code>strace</code> program, other similar programs include
+<code>truss</code>, <code>ktrace</code>, and <code>par</code>.)
+
+<blockquote><pre>
+accept(15, {sin_family=AF_INET, sin_port=htons(22283), sin_addr=inet_addr("127.0.0.1")}, [16]) = 3
+flock(18, LOCK_UN) = 0
+sigaction(SIGUSR1, {SIG_IGN}, {0x8059954, [], SA_INTERRUPT}) = 0
+getsockname(3, {sin_family=AF_INET, sin_port=htons(8080), sin_addr=inet_addr("127.0.0.1")}, [16]) = 0
+setsockopt(3, IPPROTO_TCP1, [1], 4) = 0
+read(3, "GET /6k HTTP/1.0\r\nUser-Agent: "..., 4096) = 60
+sigaction(SIGUSR1, {SIG_IGN}, {SIG_IGN}) = 0
+time(NULL) = 873959960
+gettimeofday({873959960, 404935}, NULL) = 0
+stat("/home/dgaudet/ap/apachen/htdocs/6k", {st_mode=S_IFREG|0644, st_size=6144, ...}) = 0
+open("/home/dgaudet/ap/apachen/htdocs/6k", O_RDONLY) = 4
+mmap(0, 6144, PROT_READ, MAP_PRIVATE, 4, 0) = 0x400ee000
+writev(3, [{"HTTP/1.1 200 OK\r\nDate: Thu, 11"..., 245}, {"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 6144}], 2) = 6389
+close(4) = 0
+time(NULL) = 873959960
+write(17, "127.0.0.1 - - [10/Sep/1997:23:39"..., 71) = 71
+gettimeofday({873959960, 417742}, NULL) = 0
+times({tms_utime=5, tms_stime=0, tms_cutime=0, tms_cstime=0}) = 446747
+shutdown(3, 1 /* send */) = 0
+oldselect(4, [3], NULL, [3], {2, 0}) = 1 (in [3], left {2, 0})
+read(3, "", 2048) = 0
+close(3) = 0
+sigaction(SIGUSR1, {0x8059954, [], SA_INTERRUPT}, {SIG_IGN}) = 0
+munmap(0x400ee000, 6144) = 0
+flock(18, LOCK_EX) = 0
+</pre></blockquote>
+
+<p>Notice the accept serialization:
+
+<blockquote><pre>
+flock(18, LOCK_UN) = 0
+...
+flock(18, LOCK_EX) = 0
+</pre></blockquote>
+
+These two calls can be removed by defining
+<code>SAFE_UNSERIALIZED_ACCEPT</code> as described earlier.
+
+<p>Notice the <code>SIGUSR1</code> manipulation:
+
+<blockquote><pre>
+sigaction(SIGUSR1, {SIG_IGN}, {0x8059954, [], SA_INTERRUPT}) = 0
+...
+sigaction(SIGUSR1, {SIG_IGN}, {SIG_IGN}) = 0
+...
+sigaction(SIGUSR1, {0x8059954, [], SA_INTERRUPT}, {SIG_IGN}) = 0
+</pre></blockquote>
+
+This is caused by the implementation of graceful restarts. When the
+parent receives a <code>SIGUSR1</code> it sends a <code>SIGUSR1</code>
+to all of its children (and it also increments a "generation counter"
+in shared memory). Any children that are idle (between connections)
+will immediately die
+off when they receive the signal. Any children that are in keep-alive
+connections, but are in between requests will die off immediately. But
+any children that have a connection and are still waiting for the first
+request will not die off immediately.
+
+<p>To see why this is necessary, consider how a browser reacts to a closed
+connection. If the connection was a keep-alive connection and the request
+being serviced was not the first request then the browser will quietly
+reissue the request on a new connection. It has to do this because the
+server is always free to close a keep-alive connection in between requests
+(i.e. due to a timeout or because of a maximum number of requests).
+But, if the connection is closed before the first response has been
+received the typical browser will display a "document contains no data"
+dialogue (or a broken image icon). This is done on the assumption that
+the server is broken in some way (or maybe too overloaded to respond
+at all). So Apache tries to avoid ever deliberately closing the connection
+before it has sent a single response. This is the cause of those
+<code>SIGUSR1</code> manipulations.
+
+<p>Note that it is theoretically possible to eliminate all three of
+these calls. But in rough tests the gain proved to be almost unnoticeable.
+
+<p>In order to implement virtual hosts, Apache needs to know the
+local socket address used to accept the connection:
+
+<blockquote><pre>
+getsockname(3, {sin_family=AF_INET, sin_port=htons(8080), sin_addr=inet_addr("127.0.0.1")}, [16]) = 0
+</pre></blockquote>
+
+It is possible to eliminate this call in many situations (such as when
+there are no virtual hosts, or when <code>Listen</code> directives are
+used which do not have wildcard addresses). But no effort has yet been
+made to do these optimizations.
+
+<p>Apache turns off the Nagle algorithm:
+
+<blockquote><pre>
+setsockopt(3, IPPROTO_TCP1, [1], 4) = 0
+</pre></blockquote>
+
+because of problems described in
+<a href="http://www.isi.edu/~johnh/PAPERS/Heidemann97a.html">a
+paper by John Heidemann</a>.
+
+<p>Notice the two <code>time</code> calls:
+
+<blockquote><pre>
+time(NULL) = 873959960
+...
+time(NULL) = 873959960
+</pre></blockquote>
+
+One of these occurs at the beginning of the request, and the other occurs
+as a result of writing the log. At least one of these is required to
+properly implement the HTTP protocol. The second occurs because the
+Common Log Format dictates that the log record include a timestamp of the
+end of the request. A custom logging module could eliminate one of the
+calls.
+
+<p>As described earlier, <code>Rule STATUS=yes</code> causes two
+<code>gettimeofday</code> calls and a call to <code>times</code>:
+
+<blockquote><pre>
+gettimeofday({873959960, 404935}, NULL) = 0
+...
+gettimeofday({873959960, 417742}, NULL) = 0
+times({tms_utime=5, tms_stime=0, tms_cutime=0, tms_cstime=0}) = 446747
+</pre></blockquote>
+
+These can be removed by either removing <code>mod_status</code> or
+setting <code>Rule STATUS=no</code>.
+
+<p>It might seem odd to call <code>stat</code>:
+
+<blockquote><pre>
+stat("/home/dgaudet/ap/apachen/htdocs/6k", {st_mode=S_IFREG|0644, st_size=6144, ...}) = 0
+</pre></blockquote>
+
+This is part of the algorithm which calculates the
+<code>PATH_INFO</code> for use by CGIs. In fact if the request had been
+for the URI <code>/cgi-bin/printenv/foobar</code> then there would be
+two calls to <code>stat</code>. The first for
+<code>/home/dgaudet/ap/apachen/cgi-bin/printenv/foobar</code>
+which does not exist, and the second for
+<code>/home/dgaudet/ap/apachen/cgi-bin/printenv</code>, which does exist.
+Regardless, at least one <code>stat</code> call is necessary when
+serving static files because the file size and modification times are
+used to generate HTTP headers (such as <code>Content-Length</code>,
+<code>Last-Modified</code>) and implement protocol features (such
+as <code>If-Modified-Since</code>). A somewhat more clever server
+could avoid the <code>stat</code> when serving non-static files,
+however doing so in Apache is very difficult given the modular structure.
+
+<p>All static files are served using <code>mmap</code>:
+
+<blockquote><pre>
+mmap(0, 6144, PROT_READ, MAP_PRIVATE, 4, 0) = 0x400ee000
+...
+munmap(0x400ee000, 6144) = 0
+</pre></blockquote>
+
+On some architectures it's slower to <code>mmap</code> small
+files than it is to simply <code>read</code> them. The define
+<code>MMAP_THRESHOLD</code> can be set to the minimum size required before
+using <code>mmap</code>. By default it's set to 0 (except on SunOS4
+where experimentation has shown 8192 to be a better value). Using a
+tool such as
+<a href="http://reality.sgi.com/lm_engr/lmbench/lmbench.html">lmbench</a>
+you can determine the optimal setting for your
+environment. It may even be the case that <code>mmap</code> isn't used
+on your architecture, if so then defining <code>USE_MMAP_FILES</code>
+might work (if it works then report back to us).
+
+
+<p>Apache does its best to avoid copying bytes around in memory. The
+first write of any request typically is turned into a <code>writev</code>
+which combines both the headers and the first hunk of data:
+
+<blockquote><pre>
+writev(3, [{"HTTP/1.1 200 OK\r\nDate: Thu, 11"..., 245}, {"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 6144}], 2) = 6389
+</pre></blockquote>
+
+When doing HTTP/1.1 chunked encoding Apache will generate up to four
+element <code>writev</code>s. The goal is to push the byte copying
+into the kernel, where it typically has to happen anyhow (to assemble
+network packets). On testing, various Unixes (BSDI 2.x, Solaris 2.5,
+Linux 2.0.31+) properly combine the elements into network packets.
+Pre-2.0.31 Linux will not combine, and will create a packet for
+each element, so upgrading is a good idea. Defining <code>NO_WRITEV</code>
+will disable this combining, but result in very poor chunked encoding
+performance.
+
+<p>The log write:
+
+<blockquote><pre>
+write(17, "127.0.0.1 - - [10/Sep/1997:23:39"..., 71) = 71
+</pre></blockquote>
+
+can be deferred by defining <code>BUFFERED_LOGS</code>. In this case
+up to <code>PIPE_BUF</code> bytes (a POSIX defined constant) of log entries
+are buffered before writing. At no time does it split a log entry
+across a <code>PIPE_BUF</code> boundary because those writes may not
+be atomic. (i.e. entries from multiple children could become mixed together).
+The code does it best to flush this buffer when a child dies.
+
+<p>The lingering close code causes four system calls:
+
+<blockquote><pre>
+shutdown(3, 1 /* send */) = 0
+oldselect(4, [3], NULL, [3], {2, 0}) = 1 (in [3], left {2, 0})
+read(3, "", 2048) = 0
+close(3) = 0
+</pre></blockquote>
+
+which were described earlier.
+
+<p>Let's apply some of these optimizations:
+<code>-DSAFE_UNSERIALIZED_ACCEPT -DBUFFERED_LOGS</code> and
+<code>Rule STATUS=no</code>. Here's the final trace:
+
+<blockquote><pre>
+accept(15, {sin_family=AF_INET, sin_port=htons(22286), sin_addr=inet_addr("127.0.0.1")}, [16]) = 3
+sigaction(SIGUSR1, {SIG_IGN}, {0x8058c98, [], SA_INTERRUPT}) = 0
+getsockname(3, {sin_family=AF_INET, sin_port=htons(8080), sin_addr=inet_addr("127.0.0.1")}, [16]) = 0
+setsockopt(3, IPPROTO_TCP1, [1], 4) = 0
+read(3, "GET /6k HTTP/1.0\r\nUser-Agent: "..., 4096) = 60
+sigaction(SIGUSR1, {SIG_IGN}, {SIG_IGN}) = 0
+time(NULL) = 873961916
+stat("/home/dgaudet/ap/apachen/htdocs/6k", {st_mode=S_IFREG|0644, st_size=6144, ...}) = 0
+open("/home/dgaudet/ap/apachen/htdocs/6k", O_RDONLY) = 4
+mmap(0, 6144, PROT_READ, MAP_PRIVATE, 4, 0) = 0x400e3000
+writev(3, [{"HTTP/1.1 200 OK\r\nDate: Thu, 11"..., 245}, {"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 6144}], 2) = 6389
+close(4) = 0
+time(NULL) = 873961916
+shutdown(3, 1 /* send */) = 0
+oldselect(4, [3], NULL, [3], {2, 0}) = 1 (in [3], left {2, 0})
+read(3, "", 2048) = 0
+close(3) = 0
+sigaction(SIGUSR1, {0x8058c98, [], SA_INTERRUPT}, {SIG_IGN}) = 0
+munmap(0x400e3000, 6144) = 0
+</pre></blockquote>
+
+That's 19 system calls, of which 4 remain relatively easy to remove,
+but don't seem worth the effort.
+
+<h3>Appendix: The Pre-Forking Model</h3>
+
+<p>Apache (on Unix) is a <i>pre-forking</i> model server. The
+<i>parent</i> process is responsible only for forking <i>child</i>
+processes, it does not serve any requests or service any network
+sockets. The child processes actually process connections, they serve
+multiple connections (one at a time) before dying.
+The parent spawns new or kills off old
+children in response to changes in the load on the server (it does so
+by monitoring a scoreboard which the children keep up to date).
+
+<p>This model for servers offers a robustness that other models do
+not. In particular, the parent code is very simple, and with a high
+degree of confidence the parent will continue to do its job without
+error. The children are complex, and when you add in third party
+code via modules, you risk segmentation faults and other forms of
+corruption. Even should such a thing happen, it only affects one
+connection and the server continues serving requests. The parent
+quickly replaces the dead child.
+
+<p>Pre-forking is also very portable across dialects of Unix.
+Historically this has been an important goal for Apache, and it continues
+to remain so.
+
+<p>The pre-forking model comes under criticism for various
+performance aspects. Of particular concern are the overhead
+of forking a process, the overhead of context switches between
+processes, and the memory overhead of having multiple processes.
+Furthermore it does not offer as many opportunities for data-caching
+between requests (such as a pool of <code>mmapped</code> files).
+Various other models exist and extensive analysis can be found in the
+<a href="http://www.cs.wustl.edu/~jxh/research/research.html"> papers
+of the JAWS project</a>. In practice all of these costs vary drastically
+depending on the operating system.
+
+<p>Apache's core code is already multithread aware, and Apache version
+1.3 is multithreaded on NT. There have been at least two other experimental
+implementations of threaded Apache (one using the 1.3 code base on DCE,
+and one using a custom user-level threads package and the 1.0 code base,
+neither are available publically). Part of our redesign for version 2.0
+of Apache will include abstractions of the server model so that we
+can continue to support the pre-forking model, and also support various
+threaded models.
+
+</body>
+</html>