granicus.if.org Git - apache/blob - docs/manual/mod/mod_unique_id.html.en

   1 <html xmlns="http://www.w3.org/TR/xhtml1/strict"><head><!--
   2           XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
   3                 This file is generated from xml source: DO NOT EDIT
   4           XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
   5         --><title>mod_unique_id- Apache HTTP Server</title><link href="../style/manual.css" type="text/css" rel="stylesheet"/></head><body><blockquote><div align="center"><img src="../images/sub.gif" alt="[APACHE DOCUMENTATION]"/><h3>Apache HTTP Server Version 2.0</h3></div><h1 align="center">Apache Module mod_unique_id</h1><table cellspacing="1" cellpadding="0" bgcolor="#cccccc"><tr><td><table bgcolor="#ffffff"><tr><td valign="top"><span class="help">Description:</span></td><td>Provides an environment variable with a unique
   6 identifier for each request</td></tr><tr><td><a href="module-dict.html#Status" class="help">Status:</a></td><td>Extension</td></tr><tr><td><a href="module-dict.html#ModuleIdentifier" class="help">Module&nbsp;Identifier:</a></td><td>unique_id_module</td></tr></table></td></tr></table><h2>Summary</h2>
   7
   8     <p>This module provides a magic token for each request which is
   9     guaranteed to be unique across "all" requests under very
  10     specific conditions. The unique identifier is even unique
  11     across multiple machines in a properly configured cluster of
  12     machines. The environment variable <code>UNIQUE_ID</code> is
  13     set to the identifier for each request. Unique identifiers are
  14     useful for various reasons which are beyond the scope of this
  15     document.</p>
  16 <h2>Directives</h2><p>This module provides no directives.</p><h2>Theory</h2>
  17
  18
  19     <p>First a brief recap of how the Apache server works on Unix
  20     machines. This feature currently isn't supported on Windows NT.
  21     On Unix machines, Apache creates several children, the children
  22     process requests one at a time. Each child can serve multiple
  23     requests in its lifetime. For the purpose of this discussion,
  24     the children don't share any data with each other. We'll refer
  25     to the children as httpd processes.</p>
  26
  27     <p>Your website has one or more machines under your
  28     administrative control, together we'll call them a cluster of
  29     machines. Each machine can possibly run multiple instances of
  30     Apache. All of these collectively are considered "the
  31     universe", and with certain assumptions we'll show that in this
  32     universe we can generate unique identifiers for each request,
  33     without extensive communication between machines in the
  34     cluster.</p>
  35
  36     <p>The machines in your cluster should satisfy these
  37     requirements. (Even if you have only one machine you should
  38     synchronize its clock with NTP.)</p>
  39
  40     <ul>
  41       <li>The machines' times are synchronized via NTP or other
  42       network time protocol.</li>
  43
  44       <li>The machines' hostnames all differ, such that the module
  45       can do a hostname lookup on the hostname and receive a
  46       different IP address for each machine in the cluster.</li>
  47     </ul>
  48
  49     <p>As far as operating system assumptions go, we assume that
  50     pids (process ids) fit in 32-bits. If the operating system uses
  51     more than 32-bits for a pid, the fix is trivial but must be
  52     performed in the code.</p>
  53
  54     <p>Given those assumptions, at a single point in time we can
  55     identify any httpd process on any machine in the cluster from
  56     all other httpd processes. The machine's IP address and the pid
  57     of the httpd process are sufficient to do this. So in order to
  58     generate unique identifiers for requests we need only
  59     distinguish between different points in time.</p>
  60
  61     <p>To distinguish time we will use a Unix timestamp (seconds
  62     since January 1, 1970 UTC), and a 16-bit counter. The timestamp
  63     has only one second granularity, so the counter is used to
  64     represent up to 65536 values during a single second. The
  65     quadruple <em>( ip_addr, pid, time_stamp, counter )</em> is
  66     sufficient to enumerate 65536 requests per second per httpd
  67     process. There are issues however with pid reuse over time, and
  68     the counter is used to alleviate this issue.</p>
  69
  70     <p>When an httpd child is created, the counter is initialized
  71     with ( current microseconds divided by 10 ) modulo 65536 (this
  72     formula was chosen to eliminate some variance problems with the
  73     low order bits of the microsecond timers on some systems). When
  74     a unique identifier is generated, the time stamp used is the
  75     time the request arrived at the web server. The counter is
  76     incremented every time an identifier is generated (and allowed
  77     to roll over).</p>
  78
  79     <p>The kernel generates a pid for each process as it forks the
  80     process, and pids are allowed to roll over (they're 16-bits on
  81     many Unixes, but newer systems have expanded to 32-bits). So
  82     over time the same pid will be reused. However unless it is
  83     reused within the same second, it does not destroy the
  84     uniqueness of our quadruple. That is, we assume the system does
  85     not spawn 65536 processes in a one second interval (it may even
  86     be 32768 processes on some Unixes, but even this isn't likely
  87     to happen).</p>
  88
  89     <p>Suppose that time repeats itself for some reason. That is,
  90     suppose that the system's clock is screwed up and it revisits a
  91     past time (or it is too far forward, is reset correctly, and
  92     then revisits the future time). In this case we can easily show
  93     that we can get pid and time stamp reuse. The choice of
  94     initializer for the counter is intended to help defeat this.
  95     Note that we really want a random number to initialize the
  96     counter, but there aren't any readily available numbers on most
  97     systems (<em>i.e.</em>, you can't use rand() because you need
  98     to seed the generator, and can't seed it with the time because
  99     time, at least at one second resolution, has repeated itself).
 100     This is not a perfect defense.</p>
 101
 102     <p>How good a defense is it? Suppose that one of your machines
 103     serves at most 500 requests per second (which is a very
 104     reasonable upper bound at this writing, because systems
 105     generally do more than just shovel out static files). To do
 106     that it will require a number of children which depends on how
 107     many concurrent clients you have. But we'll be pessimistic and
 108     suppose that a single child is able to serve 500 requests per
 109     second. There are 1000 possible starting counter values such
 110     that two sequences of 500 requests overlap. So there is a 1.5%
 111     chance that if time (at one second resolution) repeats itself
 112     this child will repeat a counter value, and uniqueness will be
 113     broken. This was a very pessimistic example, and with real
 114     world values it's even less likely to occur. If your system is
 115     such that it's still likely to occur, then perhaps you should
 116     make the counter 32 bits (by editing the code).</p>
 117
 118     <p>You may be concerned about the clock being "set back" during
 119     summer daylight savings. However this isn't an issue because
 120     the times used here are UTC, which "always" go forward. Note
 121     that x86 based Unixes may need proper configuration for this to
 122     be true -- they should be configured to assume that the
 123     motherboard clock is on UTC and compensate appropriately. But
 124     even still, if you're running NTP then your UTC time will be
 125     correct very shortly after reboot.</p>
 126
 127     <p>The <code>UNIQUE_ID</code> environment variable is
 128     constructed by encoding the 112-bit (32-bit IP address, 32 bit
 129     pid, 32 bit time stamp, 16 bit counter) quadruple using the
 130     alphabet <code>[A-Za-z0-9@-]</code> in a manner similar to MIME
 131     base64 encoding, producing 19 characters. The MIME base64
 132     alphabet is actually <code>[A-Za-z0-9+/]</code> however
 133     <code>+</code> and <code>/</code> need to be specially encoded
 134     in URLs, which makes them less desirable. All values are
 135     encoded in network byte ordering so that the encoding is
 136     comparable across architectures of different byte ordering. The
 137     actual ordering of the encoding is: time stamp, IP address,
 138     pid, counter. This ordering has a purpose, but it should be
 139     emphasized that applications should not dissect the encoding.
 140     Applications should treat the entire encoded
 141     <code>UNIQUE_ID</code> as an opaque token, which can be
 142     compared against other <code>UNIQUE_ID</code>s for equality
 143     only.</p>
 144
 145     <p>The ordering was chosen such that it's possible to change
 146     the encoding in the future without worrying about collision
 147     with an existing database of <code>UNIQUE_ID</code>s. The new
 148     encodings should also keep the time stamp as the first element,
 149     and can otherwise use the same alphabet and bit length. Since
 150     the time stamps are essentially an increasing sequence, it's
 151     sufficient to have a <em>flag second</em> in which all machines
 152     in the cluster stop serving and request, and stop using the old
 153     encoding format. Afterwards they can resume requests and begin
 154     issuing the new encodings.</p>
 155
 156     <p>This we believe is a relatively portable solution to this
 157     problem. It can be extended to multithreaded systems like
 158     Windows NT, and can grow with future needs. The identifiers
 159     generated have essentially an infinite life-time because future
 160     identifiers can be made longer as required. Essentially no
 161     communication is required between machines in the cluster (only
 162     NTP synchronization is required, which is low overhead), and no
 163     communication between httpd processes is required (the
 164     communication is implicit in the pid value assigned by the
 165     kernel). In very specific situations the identifier can be
 166     shortened, but more information needs to be assumed (for
 167     example the 32-bit IP address is overkill for any site, but
 168     there is no portable shorter replacement for it). </p>
 169 <hr/></blockquote><h3 align="center">Apache HTTP Server Version 2.0</h3><a href="./"><img src="../images/index.gif" alt="Index"/></a><a href="../"><img src="../images/home.gif" alt="Home"/></a></body></html>