2 <!DOCTYPE modulesynopsis SYSTEM "../style/modulesynopsis.dtd">
3 <?xml-stylesheet type="text/xsl" href="../style/manual.en.xsl"?>
4 <!-- $LastChangedRevision$ -->
7 Licensed to the Apache Software Foundation (ASF) under one or more
8 contributor license agreements. See the NOTICE file distributed with
9 this work for additional information regarding copyright ownership.
10 The ASF licenses this file to You under the Apache License, Version 2.0
11 (the "License"); you may not use this file except in compliance with
12 the License. You may obtain a copy of the License at
14 http://www.apache.org/licenses/LICENSE-2.0
16 Unless required by applicable law or agreed to in writing, software
17 distributed under the License is distributed on an "AS IS" BASIS,
18 WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
19 See the License for the specific language governing permissions and
20 limitations under the License.
23 <modulesynopsis metafile="mod_unique_id.xml.meta">
25 <name>mod_unique_id</name>
26 <description>Provides an environment variable with a unique
27 identifier for each request</description>
28 <status>Extension</status>
29 <sourcefile>mod_unique_id.c</sourcefile>
30 <identifier>unique_id_module</identifier>
34 <p>This module provides a magic token for each request which is
35 guaranteed to be unique across "all" requests under very
36 specific conditions. The unique identifier is even unique
37 across multiple machines in a properly configured cluster of
38 machines. The environment variable <code>UNIQUE_ID</code> is
39 set to the identifier for each request. Unique identifiers are
40 useful for various reasons which are beyond the scope of this
47 <p>First a brief recap of how the Apache server works on Unix
48 machines. This feature currently isn't supported on Windows NT.
49 On Unix machines, Apache creates several children, the children
50 process requests one at a time. Each child can serve multiple
51 requests in its lifetime. For the purpose of this discussion,
52 the children don't share any data with each other. We'll refer
53 to the children as <dfn>httpd processes</dfn>.</p>
55 <p>Your website has one or more machines under your
56 administrative control, together we'll call them a cluster of
57 machines. Each machine can possibly run multiple instances of
58 Apache. All of these collectively are considered "the
59 universe", and with certain assumptions we'll show that in this
60 universe we can generate unique identifiers for each request,
61 without extensive communication between machines in the
64 <p>The machines in your cluster should satisfy these
65 requirements. (Even if you have only one machine you should
66 synchronize its clock with NTP.)</p>
69 <li>The machines' times are synchronized via NTP or other
70 network time protocol.</li>
72 <li>The machines' hostnames all differ, such that the module
73 can do a hostname lookup on the hostname and receive a
74 different IP address for each machine in the cluster.</li>
77 <p>As far as operating system assumptions go, we assume that
78 pids (process ids) fit in 32-bits. If the operating system uses
79 more than 32-bits for a pid, the fix is trivial but must be
80 performed in the code.</p>
82 <p>Given those assumptions, at a single point in time we can
83 identify any httpd process on any machine in the cluster from
84 all other httpd processes. The machine's IP address and the pid
85 of the httpd process are sufficient to do this. A httpd process
86 can handle multiple requests simultaneously if you use a
87 multi-threaded MPM. In order to identify threads, we use a thread
88 index Apache httpd uses internally. So in order to
89 generate unique identifiers for requests we need only
90 distinguish between different points in time.</p>
92 <p>To distinguish time we will use a Unix timestamp (seconds
93 since January 1, 1970 UTC), and a 16-bit counter. The timestamp
94 has only one second granularity, so the counter is used to
95 represent up to 65536 values during a single second. The
96 quadruple <em>( ip_addr, pid, time_stamp, counter )</em> is
97 sufficient to enumerate 65536 requests per second per httpd
98 process. There are issues however with pid reuse over time, and
99 the counter is used to alleviate this issue.</p>
101 <p>When an httpd child is created, the counter is initialized
102 with ( current microseconds divided by 10 ) modulo 65536 (this
103 formula was chosen to eliminate some variance problems with the
104 low order bits of the microsecond timers on some systems). When
105 a unique identifier is generated, the time stamp used is the
106 time the request arrived at the web server. The counter is
107 incremented every time an identifier is generated (and allowed
110 <p>The kernel generates a pid for each process as it forks the
111 process, and pids are allowed to roll over (they're 16-bits on
112 many Unixes, but newer systems have expanded to 32-bits). So
113 over time the same pid will be reused. However unless it is
114 reused within the same second, it does not destroy the
115 uniqueness of our quadruple. That is, we assume the system does
116 not spawn 65536 processes in a one second interval (it may even
117 be 32768 processes on some Unixes, but even this isn't likely
120 <p>Suppose that time repeats itself for some reason. That is,
121 suppose that the system's clock is screwed up and it revisits a
122 past time (or it is too far forward, is reset correctly, and
123 then revisits the future time). In this case we can easily show
124 that we can get pid and time stamp reuse. The choice of
125 initializer for the counter is intended to help defeat this.
126 Note that we really want a random number to initialize the
127 counter, but there aren't any readily available numbers on most
128 systems (<em>i.e.</em>, you can't use rand() because you need
129 to seed the generator, and can't seed it with the time because
130 time, at least at one second resolution, has repeated itself).
131 This is not a perfect defense.</p>
133 <p>How good a defense is it? Suppose that one of your machines
134 serves at most 500 requests per second (which is a very
135 reasonable upper bound at this writing, because systems
136 generally do more than just shovel out static files). To do
137 that it will require a number of children which depends on how
138 many concurrent clients you have. But we'll be pessimistic and
139 suppose that a single child is able to serve 500 requests per
140 second. There are 1000 possible starting counter values such
141 that two sequences of 500 requests overlap. So there is a 1.5%
142 chance that if time (at one second resolution) repeats itself
143 this child will repeat a counter value, and uniqueness will be
144 broken. This was a very pessimistic example, and with real
145 world values it's even less likely to occur. If your system is
146 such that it's still likely to occur, then perhaps you should
147 make the counter 32 bits (by editing the code).</p>
149 <p>You may be concerned about the clock being "set back" during
150 summer daylight savings. However this isn't an issue because
151 the times used here are UTC, which "always" go forward. Note
152 that x86 based Unixes may need proper configuration for this to
153 be true -- they should be configured to assume that the
154 motherboard clock is on UTC and compensate appropriately. But
155 even still, if you're running NTP then your UTC time will be
156 correct very shortly after reboot.</p>
158 <!-- FIXME: thread_index is unsigned int, so not always 32bit.-->
159 <p>The <code>UNIQUE_ID</code> environment variable is
160 constructed by encoding the 144-bit (32-bit IP address, 32 bit
161 pid, 32 bit time stamp, 16 bit counter, 32 bit thread index)
163 alphabet <code>[A-Za-z0-9@-]</code> in a manner similar to MIME
164 base64 encoding, producing 24 characters. The MIME base64
165 alphabet is actually <code>[A-Za-z0-9+/]</code> however
166 <code>+</code> and <code>/</code> need to be specially encoded
167 in URLs, which makes them less desirable. All values are
168 encoded in network byte ordering so that the encoding is
169 comparable across architectures of different byte ordering. The
170 actual ordering of the encoding is: time stamp, IP address,
171 pid, counter. This ordering has a purpose, but it should be
172 emphasized that applications should not dissect the encoding.
173 Applications should treat the entire encoded
174 <code>UNIQUE_ID</code> as an opaque token, which can be
175 compared against other <code>UNIQUE_ID</code>s for equality
178 <p>The ordering was chosen such that it's possible to change
179 the encoding in the future without worrying about collision
180 with an existing database of <code>UNIQUE_ID</code>s. The new
181 encodings should also keep the time stamp as the first element,
182 and can otherwise use the same alphabet and bit length. Since
183 the time stamps are essentially an increasing sequence, it's
184 sufficient to have a <em>flag second</em> in which all machines
185 in the cluster stop serving and request, and stop using the old
186 encoding format. Afterwards they can resume requests and begin
187 issuing the new encodings.</p>
189 <p>This we believe is a relatively portable solution to this
190 problem. The identifiers
191 generated have essentially an infinite life-time because future
192 identifiers can be made longer as required. Essentially no
193 communication is required between machines in the cluster (only
194 NTP synchronization is required, which is low overhead), and no
195 communication between httpd processes is required (the
196 communication is implicit in the pid value assigned by the
197 kernel). In very specific situations the identifier can be
198 shortened, but more information needs to be assumed (for
199 example the 32-bit IP address is overkill for any site, but
200 there is no portable shorter replacement for it). </p>