1 <html xmlns="http://www.w3.org/TR/xhtml1/strict"><head><!--
2 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
3 This file is generated from xml source: DO NOT EDIT
4 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
5 --><title>mod_unique_id- Apache HTTP Server</title><link href="../style/manual.css" type="text/css" rel="stylesheet"/></head><body><blockquote><div align="center"><img src="../images/sub.gif" alt="[APACHE DOCUMENTATION]"/><h3>Apache HTTP Server Version 2.0</h3></div><h1 align="center">Apache Module mod_unique_id</h1><table cellspacing="1" cellpadding="0" bgcolor="#cccccc"><tr><td><table bgcolor="#ffffff"><tr><td valign="top"><span class="help">Description:</span></td><td>Provides an environment variable with a unique
6 identifier for each request</td></tr><tr><td><a href="module-dict.html#Status" class="help">Status:</a></td><td>Extension</td></tr><tr><td><a href="module-dict.html#ModuleIdentifier" class="help">Module Identifier:</a></td><td>unique_id_module</td></tr></table></td></tr></table><h2>Summary</h2>
8 <p>This module provides a magic token for each request which is
9 guaranteed to be unique across "all" requests under very
10 specific conditions. The unique identifier is even unique
11 across multiple machines in a properly configured cluster of
12 machines. The environment variable <code>UNIQUE_ID</code> is
13 set to the identifier for each request. Unique identifiers are
14 useful for various reasons which are beyond the scope of this
16 <h2>Directives</h2><p>This module provides no directives.</p><h2>Theory</h2>
19 <p>First a brief recap of how the Apache server works on Unix
20 machines. This feature currently isn't supported on Windows NT.
21 On Unix machines, Apache creates several children, the children
22 process requests one at a time. Each child can serve multiple
23 requests in its lifetime. For the purpose of this discussion,
24 the children don't share any data with each other. We'll refer
25 to the children as httpd processes.</p>
27 <p>Your website has one or more machines under your
28 administrative control, together we'll call them a cluster of
29 machines. Each machine can possibly run multiple instances of
30 Apache. All of these collectively are considered "the
31 universe", and with certain assumptions we'll show that in this
32 universe we can generate unique identifiers for each request,
33 without extensive communication between machines in the
36 <p>The machines in your cluster should satisfy these
37 requirements. (Even if you have only one machine you should
38 synchronize its clock with NTP.)</p>
41 <li>The machines' times are synchronized via NTP or other
42 network time protocol.</li>
44 <li>The machines' hostnames all differ, such that the module
45 can do a hostname lookup on the hostname and receive a
46 different IP address for each machine in the cluster.</li>
49 <p>As far as operating system assumptions go, we assume that
50 pids (process ids) fit in 32-bits. If the operating system uses
51 more than 32-bits for a pid, the fix is trivial but must be
52 performed in the code.</p>
54 <p>Given those assumptions, at a single point in time we can
55 identify any httpd process on any machine in the cluster from
56 all other httpd processes. The machine's IP address and the pid
57 of the httpd process are sufficient to do this. So in order to
58 generate unique identifiers for requests we need only
59 distinguish between different points in time.</p>
61 <p>To distinguish time we will use a Unix timestamp (seconds
62 since January 1, 1970 UTC), and a 16-bit counter. The timestamp
63 has only one second granularity, so the counter is used to
64 represent up to 65536 values during a single second. The
65 quadruple <em>( ip_addr, pid, time_stamp, counter )</em> is
66 sufficient to enumerate 65536 requests per second per httpd
67 process. There are issues however with pid reuse over time, and
68 the counter is used to alleviate this issue.</p>
70 <p>When an httpd child is created, the counter is initialized
71 with ( current microseconds divided by 10 ) modulo 65536 (this
72 formula was chosen to eliminate some variance problems with the
73 low order bits of the microsecond timers on some systems). When
74 a unique identifier is generated, the time stamp used is the
75 time the request arrived at the web server. The counter is
76 incremented every time an identifier is generated (and allowed
79 <p>The kernel generates a pid for each process as it forks the
80 process, and pids are allowed to roll over (they're 16-bits on
81 many Unixes, but newer systems have expanded to 32-bits). So
82 over time the same pid will be reused. However unless it is
83 reused within the same second, it does not destroy the
84 uniqueness of our quadruple. That is, we assume the system does
85 not spawn 65536 processes in a one second interval (it may even
86 be 32768 processes on some Unixes, but even this isn't likely
89 <p>Suppose that time repeats itself for some reason. That is,
90 suppose that the system's clock is screwed up and it revisits a
91 past time (or it is too far forward, is reset correctly, and
92 then revisits the future time). In this case we can easily show
93 that we can get pid and time stamp reuse. The choice of
94 initializer for the counter is intended to help defeat this.
95 Note that we really want a random number to initialize the
96 counter, but there aren't any readily available numbers on most
97 systems (<em>i.e.</em>, you can't use rand() because you need
98 to seed the generator, and can't seed it with the time because
99 time, at least at one second resolution, has repeated itself).
100 This is not a perfect defense.</p>
102 <p>How good a defense is it? Suppose that one of your machines
103 serves at most 500 requests per second (which is a very
104 reasonable upper bound at this writing, because systems
105 generally do more than just shovel out static files). To do
106 that it will require a number of children which depends on how
107 many concurrent clients you have. But we'll be pessimistic and
108 suppose that a single child is able to serve 500 requests per
109 second. There are 1000 possible starting counter values such
110 that two sequences of 500 requests overlap. So there is a 1.5%
111 chance that if time (at one second resolution) repeats itself
112 this child will repeat a counter value, and uniqueness will be
113 broken. This was a very pessimistic example, and with real
114 world values it's even less likely to occur. If your system is
115 such that it's still likely to occur, then perhaps you should
116 make the counter 32 bits (by editing the code).</p>
118 <p>You may be concerned about the clock being "set back" during
119 summer daylight savings. However this isn't an issue because
120 the times used here are UTC, which "always" go forward. Note
121 that x86 based Unixes may need proper configuration for this to
122 be true -- they should be configured to assume that the
123 motherboard clock is on UTC and compensate appropriately. But
124 even still, if you're running NTP then your UTC time will be
125 correct very shortly after reboot.</p>
127 <p>The <code>UNIQUE_ID</code> environment variable is
128 constructed by encoding the 112-bit (32-bit IP address, 32 bit
129 pid, 32 bit time stamp, 16 bit counter) quadruple using the
130 alphabet <code>[A-Za-z0-9@-]</code> in a manner similar to MIME
131 base64 encoding, producing 19 characters. The MIME base64
132 alphabet is actually <code>[A-Za-z0-9+/]</code> however
133 <code>+</code> and <code>/</code> need to be specially encoded
134 in URLs, which makes them less desirable. All values are
135 encoded in network byte ordering so that the encoding is
136 comparable across architectures of different byte ordering. The
137 actual ordering of the encoding is: time stamp, IP address,
138 pid, counter. This ordering has a purpose, but it should be
139 emphasized that applications should not dissect the encoding.
140 Applications should treat the entire encoded
141 <code>UNIQUE_ID</code> as an opaque token, which can be
142 compared against other <code>UNIQUE_ID</code>s for equality
145 <p>The ordering was chosen such that it's possible to change
146 the encoding in the future without worrying about collision
147 with an existing database of <code>UNIQUE_ID</code>s. The new
148 encodings should also keep the time stamp as the first element,
149 and can otherwise use the same alphabet and bit length. Since
150 the time stamps are essentially an increasing sequence, it's
151 sufficient to have a <em>flag second</em> in which all machines
152 in the cluster stop serving and request, and stop using the old
153 encoding format. Afterwards they can resume requests and begin
154 issuing the new encodings.</p>
156 <p>This we believe is a relatively portable solution to this
157 problem. It can be extended to multithreaded systems like
158 Windows NT, and can grow with future needs. The identifiers
159 generated have essentially an infinite life-time because future
160 identifiers can be made longer as required. Essentially no
161 communication is required between machines in the cluster (only
162 NTP synchronization is required, which is low overhead), and no
163 communication between httpd processes is required (the
164 communication is implicit in the pid value assigned by the
165 kernel). In very specific situations the identifier can be
166 shortened, but more information needs to be assumed (for
167 example the 32-bit IP address is overkill for any site, but
168 there is no portable shorter replacement for it). </p>
169 <hr/></blockquote><h3 align="center">Apache HTTP Server Version 2.0</h3><a href="./"><img src="../images/index.gif" alt="Index"/></a><a href="../"><img src="../images/home.gif" alt="Home"/></a></body></html>