granicus.if.org Git - apache/blob - docs/manual/mod/mod_unique_id.xml

   1 <?xml version="1.0"?>
   2 <!DOCTYPE modulesynopsis SYSTEM "../style/modulesynopsis.dtd">
   3 <?xml-stylesheet type="text/xsl" href="../style/manual.en.xsl"?>
   4 <!-- $LastChangedRevision$ -->
   5
   6 <!--
   7  Licensed to the Apache Software Foundation (ASF) under one or more
   8  contributor license agreements.  See the NOTICE file distributed with
   9  this work for additional information regarding copyright ownership.
  10  The ASF licenses this file to You under the Apache License, Version 2.0
  11  (the "License"); you may not use this file except in compliance with
  12  the License.  You may obtain a copy of the License at
  13
  14      http://www.apache.org/licenses/LICENSE-2.0
  15
  16  Unless required by applicable law or agreed to in writing, software
  17  distributed under the License is distributed on an "AS IS" BASIS,
  18  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  19  See the License for the specific language governing permissions and
  20  limitations under the License.
  21 -->
  22
  23 <modulesynopsis metafile="mod_unique_id.xml.meta">
  24
  25 <name>mod_unique_id</name>
  26 <description>Provides an environment variable with a unique
  27 identifier for each request</description>
  28 <status>Extension</status>
  29 <sourcefile>mod_unique_id.c</sourcefile>
  30 <identifier>unique_id_module</identifier>
  31
  32 <summary>
  33
  34     <p>This module provides a magic token for each request which is
  35     guaranteed to be unique across "all" requests under very
  36     specific conditions. The unique identifier is even unique
  37     across multiple machines in a properly configured cluster of
  38     machines. The environment variable <code>UNIQUE_ID</code> is
  39     set to the identifier for each request. Unique identifiers are
  40     useful for various reasons which are beyond the scope of this
  41     document.</p>
  42 </summary>
  43
  44 <section id="theory">
  45     <title>Theory</title>
  46
  47     <p>First a brief recap of how the Apache server works on Unix
  48     machines. This feature currently isn't supported on Windows NT.
  49     On Unix machines, Apache creates several children, the children
  50     process requests one at a time. Each child can serve multiple
  51     requests in its lifetime. For the purpose of this discussion,
  52     the children don't share any data with each other. We'll refer
  53     to the children as <dfn>httpd processes</dfn>.</p>
  54
  55     <p>Your website has one or more machines under your
  56     administrative control, together we'll call them a cluster of
  57     machines. Each machine can possibly run multiple instances of
  58     Apache. All of these collectively are considered "the
  59     universe", and with certain assumptions we'll show that in this
  60     universe we can generate unique identifiers for each request,
  61     without extensive communication between machines in the
  62     cluster.</p>
  63
  64     <p>The machines in your cluster should satisfy these
  65     requirements. (Even if you have only one machine you should
  66     synchronize its clock with NTP.)</p>
  67
  68     <ul>
  69       <li>The machines' times are synchronized via NTP or other
  70       network time protocol.</li>
  71
  72       <li>The machines' hostnames all differ, such that the module
  73       can do a hostname lookup on the hostname and receive a
  74       different IP address for each machine in the cluster.</li>
  75     </ul>
  76
  77     <p>As far as operating system assumptions go, we assume that
  78     pids (process ids) fit in 32-bits. If the operating system uses
  79     more than 32-bits for a pid, the fix is trivial but must be
  80     performed in the code.</p>
  81
  82     <p>Given those assumptions, at a single point in time we can
  83     identify any httpd process on any machine in the cluster from
  84     all other httpd processes. The machine's IP address and the pid
  85     of the httpd process are sufficient to do this. A httpd process
  86     can handle multiple requests simultaneously if you use a
  87     multi-threaded MPM. In order to identify threads, we use a thread
  88     index Apache httpd uses internally. So in order to
  89     generate unique identifiers for requests we need only
  90     distinguish between different points in time.</p>
  91
  92     <p>To distinguish time we will use a Unix timestamp (seconds
  93     since January 1, 1970 UTC), and a 16-bit counter. The timestamp
  94     has only one second granularity, so the counter is used to
  95     represent up to 65536 values during a single second. The
  96     quadruple <em>( ip_addr, pid, time_stamp, counter )</em> is
  97     sufficient to enumerate 65536 requests per second per httpd
  98     process. There are issues however with pid reuse over time, and
  99     the counter is used to alleviate this issue.</p>
 100
 101     <p>When an httpd child is created, the counter is initialized
 102     with ( current microseconds divided by 10 ) modulo 65536 (this
 103     formula was chosen to eliminate some variance problems with the
 104     low order bits of the microsecond timers on some systems). When
 105     a unique identifier is generated, the time stamp used is the
 106     time the request arrived at the web server. The counter is
 107     incremented every time an identifier is generated (and allowed
 108     to roll over).</p>
 109
 110     <p>The kernel generates a pid for each process as it forks the
 111     process, and pids are allowed to roll over (they're 16-bits on
 112     many Unixes, but newer systems have expanded to 32-bits). So
 113     over time the same pid will be reused. However unless it is
 114     reused within the same second, it does not destroy the
 115     uniqueness of our quadruple. That is, we assume the system does
 116     not spawn 65536 processes in a one second interval (it may even
 117     be 32768 processes on some Unixes, but even this isn't likely
 118     to happen).</p>
 119
 120     <p>Suppose that time repeats itself for some reason. That is,
 121     suppose that the system's clock is screwed up and it revisits a
 122     past time (or it is too far forward, is reset correctly, and
 123     then revisits the future time). In this case we can easily show
 124     that we can get pid and time stamp reuse. The choice of
 125     initializer for the counter is intended to help defeat this.
 126     Note that we really want a random number to initialize the
 127     counter, but there aren't any readily available numbers on most
 128     systems (<em>i.e.</em>, you can't use rand() because you need
 129     to seed the generator, and can't seed it with the time because
 130     time, at least at one second resolution, has repeated itself).
 131     This is not a perfect defense.</p>
 132
 133     <p>How good a defense is it? Suppose that one of your machines
 134     serves at most 500 requests per second (which is a very
 135     reasonable upper bound at this writing, because systems
 136     generally do more than just shovel out static files). To do
 137     that it will require a number of children which depends on how
 138     many concurrent clients you have. But we'll be pessimistic and
 139     suppose that a single child is able to serve 500 requests per
 140     second. There are 1000 possible starting counter values such
 141     that two sequences of 500 requests overlap. So there is a 1.5%
 142     chance that if time (at one second resolution) repeats itself
 143     this child will repeat a counter value, and uniqueness will be
 144     broken. This was a very pessimistic example, and with real
 145     world values it's even less likely to occur. If your system is
 146     such that it's still likely to occur, then perhaps you should
 147     make the counter 32 bits (by editing the code).</p>
 148
 149     <p>You may be concerned about the clock being "set back" during
 150     summer daylight savings. However this isn't an issue because
 151     the times used here are UTC, which "always" go forward. Note
 152     that x86 based Unixes may need proper configuration for this to
 153     be true -- they should be configured to assume that the
 154     motherboard clock is on UTC and compensate appropriately. But
 155     even still, if you're running NTP then your UTC time will be
 156     correct very shortly after reboot.</p>
 157
 158     <!-- FIXME: thread_index is unsigned int, so not always 32bit.-->
 159     <p>The <code>UNIQUE_ID</code> environment variable is
 160     constructed by encoding the 144-bit (32-bit IP address, 32 bit
 161     pid, 32 bit time stamp, 16 bit counter, 32 bit thread index)
 162     quadruple using the
 163     alphabet <code>[A-Za-z0-9@-]</code> in a manner similar to MIME
 164     base64 encoding, producing 24 characters. The MIME base64
 165     alphabet is actually <code>[A-Za-z0-9+/]</code> however
 166     <code>+</code> and <code>/</code> need to be specially encoded
 167     in URLs, which makes them less desirable. All values are
 168     encoded in network byte ordering so that the encoding is
 169     comparable across architectures of different byte ordering. The
 170     actual ordering of the encoding is: time stamp, IP address,
 171     pid, counter. This ordering has a purpose, but it should be
 172     emphasized that applications should not dissect the encoding.
 173     Applications should treat the entire encoded
 174     <code>UNIQUE_ID</code> as an opaque token, which can be
 175     compared against other <code>UNIQUE_ID</code>s for equality
 176     only.</p>
 177
 178     <p>The ordering was chosen such that it's possible to change
 179     the encoding in the future without worrying about collision
 180     with an existing database of <code>UNIQUE_ID</code>s. The new
 181     encodings should also keep the time stamp as the first element,
 182     and can otherwise use the same alphabet and bit length. Since
 183     the time stamps are essentially an increasing sequence, it's
 184     sufficient to have a <em>flag second</em> in which all machines
 185     in the cluster stop serving and request, and stop using the old
 186     encoding format. Afterwards they can resume requests and begin
 187     issuing the new encodings.</p>
 188
 189     <p>This we believe is a relatively portable solution to this
 190     problem. The identifiers
 191     generated have essentially an infinite life-time because future
 192     identifiers can be made longer as required. Essentially no
 193     communication is required between machines in the cluster (only
 194     NTP synchronization is required, which is low overhead), and no
 195     communication between httpd processes is required (the
 196     communication is implicit in the pid value assigned by the
 197     kernel). In very specific situations the identifier can be
 198     shortened, but more information needs to be assumed (for
 199     example the 32-bit IP address is overkill for any site, but
 200     there is no portable shorter replacement for it). </p>
 201 </section>
 202
 203
 204 </modulesynopsis>