]> granicus.if.org Git - apache/commitdiff
Removes the 'block evil robots' rule from rewrite_guide, moves it to
authorRich Bowen <rbowen@apache.org>
Mon, 2 Nov 2009 22:57:44 +0000 (22:57 +0000)
committerRich Bowen <rbowen@apache.org>
Mon, 2 Nov 2009 22:57:44 +0000 (22:57 +0000)
access, and makes it not suck.

git-svn-id: https://svn.apache.org/repos/asf/httpd/httpd/trunk@832175 13f79535-47bb-0310-9956-ffa450edef68

docs/manual/rewrite/access.html.en
docs/manual/rewrite/access.xml
docs/manual/rewrite/rewrite_guide.html.en
docs/manual/rewrite/rewrite_guide.xml

index d04cf740586d1bcf99e96e45fbecd90f1b3b6773..cc0f1743a8502bc0d3dabbec7a4f8cad2b75333d 100644 (file)
@@ -36,7 +36,78 @@ configuration.</p>
 
 </div>
 <div id="quickview"><h3>See also</h3><ul class="seealso"><li><a href="../mod/mod_rewrite.html">Module documentation</a></li><li><a href="intro.html">mod_rewrite introduction</a></li></ul></div>
-</div>
+<div class="top"><a href="#page-header"><img alt="top" src="../images/up.gif" /></a></div>
+<div class="section">
+<h2><a name="blocking-of-robots" id="blocking-of-robots">Blocking of Robots</a></h2>
+
+      
+
+      <dl>
+        <dt>Description:</dt>
+
+        <dd>
+        <p>
+        In this recipe, we discuss how to block persistent requests from
+        a particular robot, or user agent.</p>
+
+        <p>The standard for robot exclusion defines a file,
+        <code>/robots.txt</code> that specifies those portions of your
+        website where you which to exclude robots. However, some robots
+        do not honor these files.
+        </p>
+
+        <p>Note that there are methods of accomplishing this which do
+        not use mod_rewrite. Note also that any technique that relies on
+        the clients <code>USER_AGENT</code> string can be circumvented
+        very easily, since that string can be changed.</p>
+        </dd>
+
+        <dt>Solution:</dt>
+
+        <dd>
+        <p>We use a ruleset that specifies the directory to be
+        protected, and the client <code>USER_AGENT</code> that
+        identifies the malicious or persistent robot.</p>
+
+        <p>In this example, we are blocking a robot called
+        <code>NameOfBadRobot</code> from a location
+        <code>/secret/files</code>. You may also specify an IP address
+        range, if you are trying to block that user agent only from the
+        particular source.</p>
+
+<div class="example"><pre>
+RewriteCond %{HTTP_USER_AGENT}   ^<strong>NameOfBadRobot</strong>
+RewriteCond %{REMOTE_ADDR}       =<strong>123\.45\.67\.[8-9]</strong>
+RewriteRule ^<strong>/secret/files/</strong>   -   [<strong>F</strong>]
+</pre></div>
+        </dd>
+
+      <dt>Discussion</dt>
+
+      <dd>
+      <p>
+        Rather than using mod_rewrite for this, you can accomplish the
+        same end using alternate means, as illustrated here:
+      </p>
+      <div class="example"><p><code>
+      SetEnvIfNoCase User-Agent ^NameOfBadRobot goaway<br />
+      &lt;Location /secret/files&gt;<br />
+      Order allow,deny<br />
+      Allow from all<br />
+      Deny from env=goaway
+      </code></p></div>
+      <p>
+      As noted above, this technique is trivial to circumvent, by simply
+      modifying the <code>USER_AGENT</code> request header. If you
+      are experiencing a sustained attack, you should consider blocking
+      it at a higher level, such as at your firewall.
+      </p>
+
+      </dd>
+
+      </dl>
+
+    </div></div>
 <div class="bottomlang">
 <p><span>Available Languages: </span><a href="../en/rewrite/access.html" title="English">&nbsp;en&nbsp;</a></p>
 </div><div id="footer">
index 1c9c976b9c9c258b8b21a2a0c3c02d77bd80e118..c45c51e05d2608979d7cdbfd25eb9899c7ec6969 100644 (file)
@@ -43,5 +43,75 @@ configuration.</p>
 <seealso><a href="../mod/mod_rewrite.html">Module documentation</a></seealso>
 <seealso><a href="intro.html">mod_rewrite introduction</a></seealso>
 
+    <section id="blocking-of-robots">
+
+      <title>Blocking of Robots</title>
+
+      <dl>
+        <dt>Description:</dt>
+
+        <dd>
+        <p>
+        In this recipe, we discuss how to block persistent requests from
+        a particular robot, or user agent.</p>
+
+        <p>The standard for robot exclusion defines a file,
+        <code>/robots.txt</code> that specifies those portions of your
+        website where you which to exclude robots. However, some robots
+        do not honor these files.
+        </p>
+
+        <p>Note that there are methods of accomplishing this which do
+        not use mod_rewrite. Note also that any technique that relies on
+        the clients <code>USER_AGENT</code> string can be circumvented
+        very easily, since that string can be changed.</p>
+        </dd>
+
+        <dt>Solution:</dt>
+
+        <dd>
+        <p>We use a ruleset that specifies the directory to be
+        protected, and the client <code>USER_AGENT</code> that
+        identifies the malicious or persistent robot.</p>
+
+        <p>In this example, we are blocking a robot called
+        <code>NameOfBadRobot</code> from a location
+        <code>/secret/files</code>. You may also specify an IP address
+        range, if you are trying to block that user agent only from the
+        particular source.</p>
+
+<example><pre>
+RewriteCond %{HTTP_USER_AGENT}   ^<strong>NameOfBadRobot</strong>
+RewriteCond %{REMOTE_ADDR}       =<strong>123\.45\.67\.[8-9]</strong>
+RewriteRule ^<strong>/secret/files/</strong>   -   [<strong>F</strong>]
+</pre></example>
+        </dd>
+
+      <dt>Discussion</dt>
+
+      <dd>
+      <p>
+        Rather than using mod_rewrite for this, you can accomplish the
+        same end using alternate means, as illustrated here:
+      </p>
+      <example>
+      SetEnvIfNoCase User-Agent ^NameOfBadRobot goaway<br />
+      &lt;Location /secret/files&gt;<br />
+      Order allow,deny<br />
+      Allow from all<br />
+      Deny from env=goaway
+      </example>
+      <p>
+      As noted above, this technique is trivial to circumvent, by simply
+      modifying the <code>USER_AGENT</code> request header. If you
+      are experiencing a sustained attack, you should consider blocking
+      it at a higher level, such as at your firewall.
+      </p>
+
+      </dd>
+
+      </dl>
+
+    </section>
 
 </manualpage> 
index 1463ba8deef89dfbc327051ad5a3e4b9ca3a3a4b..f50aa5c085d11ade56113f692dfc9522f76173af 100644 (file)
@@ -56,7 +56,6 @@
 <li><img alt="" src="../images/down.gif" /> <a href="#old-to-new">From Old to New (intern)</a></li>
 <li><img alt="" src="../images/down.gif" /> <a href="#old-to-new-extern">From Old to New (extern)</a></li>
 <li><img alt="" src="../images/down.gif" /> <a href="#static-to-dynamic">From Static to Dynamic</a></li>
-<li><img alt="" src="../images/down.gif" /> <a href="#blocking-of-robots">Blocking of Robots</a></li>
 <li><img alt="" src="../images/down.gif" /> <a href="#blocked-inline-images">Forbidding Image "Hotlinking"</a></li>
 <li><img alt="" src="../images/down.gif" /> <a href="#proxy-deny">Proxy Deny</a></li>
 <li><img alt="" src="../images/down.gif" /> <a href="#external-rewriting">External Rewriting Engine</a></li>
@@ -653,44 +652,6 @@ RewriteRule    ^foo\.<strong>html</strong>$  foo.<strong>cgi</strong>  [H=<stron
 
     </div><div class="top"><a href="#page-header"><img alt="top" src="../images/up.gif" /></a></div>
 <div class="section">
-<h2><a name="blocking-of-robots" id="blocking-of-robots">Blocking of Robots</a></h2>
-
-      
-
-      <dl>
-        <dt>Description:</dt>
-
-        <dd>
-          <p>How can we block a really annoying robot from
-          retrieving pages of a specific webarea? A
-          <code>/robots.txt</code> file containing entries of the
-          "Robot Exclusion Protocol" is typically not enough to get
-          rid of such a robot.</p>
-        </dd>
-
-        <dt>Solution:</dt>
-
-        <dd>
-          <p>We use a ruleset which forbids the URLs of the webarea
-          <code>/~quux/foo/arc/</code> (perhaps a very deep
-          directory indexed area where the robot traversal would
-          create big server load). We have to make sure that we
-          forbid access only to the particular robot, i.e. just
-          forbidding the host where the robot runs is not enough.
-          This would block users from this host, too. We accomplish
-          this by also matching the User-Agent HTTP header
-          information.</p>
-
-<div class="example"><pre>
-RewriteCond %{HTTP_USER_AGENT}   ^<strong>NameOfBadRobot</strong>.*
-RewriteCond %{REMOTE_ADDR}       ^<strong>123\.45\.67\.[8-9]</strong>$
-RewriteRule ^<strong>/~quux/foo/arc/</strong>.+   -   [<strong>F</strong>]
-</pre></div>
-        </dd>
-      </dl>
-
-    </div><div class="top"><a href="#page-header"><img alt="top" src="../images/up.gif" /></a></div>
-<div class="section">
 <h2><a name="blocked-inline-images" id="blocked-inline-images">Forbidding Image "Hotlinking"</a></h2>
 
       
index 9df0686046fe28c68cb9c36e8fde122e23e9f708..46a40be33aa38261f57379c1c644bca1cd3e3ec1 100644 (file)
@@ -627,44 +627,6 @@ RewriteRule    ^foo\.<strong>html</strong>$  foo.<strong>cgi</strong>  [H=<stron
 
     </section>
 
-    <section id="blocking-of-robots">
-
-      <title>Blocking of Robots</title>
-
-      <dl>
-        <dt>Description:</dt>
-
-        <dd>
-          <p>How can we block a really annoying robot from
-          retrieving pages of a specific webarea? A
-          <code>/robots.txt</code> file containing entries of the
-          "Robot Exclusion Protocol" is typically not enough to get
-          rid of such a robot.</p>
-        </dd>
-
-        <dt>Solution:</dt>
-
-        <dd>
-          <p>We use a ruleset which forbids the URLs of the webarea
-          <code>/~quux/foo/arc/</code> (perhaps a very deep
-          directory indexed area where the robot traversal would
-          create big server load). We have to make sure that we
-          forbid access only to the particular robot, i.e. just
-          forbidding the host where the robot runs is not enough.
-          This would block users from this host, too. We accomplish
-          this by also matching the User-Agent HTTP header
-          information.</p>
-
-<example><pre>
-RewriteCond %{HTTP_USER_AGENT}   ^<strong>NameOfBadRobot</strong>.*
-RewriteCond %{REMOTE_ADDR}       ^<strong>123\.45\.67\.[8-9]</strong>$
-RewriteRule ^<strong>/~quux/foo/arc/</strong>.+   -   [<strong>F</strong>]
-</pre></example>
-        </dd>
-      </dl>
-
-    </section>
-
     <section id="blocked-inline-images">
 
       <title>Forbidding Image &quot;Hotlinking&quot;</title>