From e95fec6cb9034604b486851c24722b14f1b36fc4 Mon Sep 17 00:00:00 2001 From: brian Date: Mon, 9 Dec 1996 04:49:12 +0000 Subject: [PATCH] Obtained from: Rob Hartill, with some stuff by Brian Behlendorf Rob modified the section on robot detection and I modified the section on redirecting an entire server to include a section on mod_rewrite git-svn-id: https://svn.apache.org/repos/asf/httpd/httpd/trunk@77230 13f79535-47bb-0310-9956-ffa450edef68 --- docs/manual/misc/howto.html | 190 +++++++++++++++++++----------------- 1 file changed, 103 insertions(+), 87 deletions(-) diff --git a/docs/manual/misc/howto.html b/docs/manual/misc/howto.html index 72c2d09ed5..fba0310c50 100644 --- a/docs/manual/misc/howto.html +++ b/docs/manual/misc/howto.html @@ -1,123 +1,139 @@ + + Apache HOWTO documentation -

Apache HOWTO documentation

+

Apache HOWTO documentation

How to: -
-

How to redirect an entire server or directory

+
+

How to redirect an entire server or directory to a single URL

-One way to redirect all requests for an entire server is to setup a -Redirect to a cgi script which outputs a 301 or 302 status -and the location of the other server.

+

There are two chief ways to redirect all requests for an entire +server to a single location: one which requires the use of +mod_rewrite, and another which uses a CGI script. -By using a cgi-script you can intercept various requests and treat them -specially, e.g. you might want to intercept POST requests, so that the -client isn't redirected to a script on the other server which expects POST -information (a redirect will lose the POST information.)

+

First: if all you need to do is migrate a server from one name to +another, simply use the Redirect directive, as supplied +by mod_alias: -Here's how to redirect all requests to a script... In the server configuration -file, -

ScriptAlias / -/usr/local/httpd/cgi-bin/redirect_script
+
+  Redirect / http://www.apache.org/
+
-and here's a simple perl script to redirect +

Since Redirect will forward along the complete path, +however, it may not be appropriate - for example, when the directory +structure has changed after the move, and you simply want to direct people +to the home page. -

-#!/usr/local/bin/perl
-
-print "Status: 302 Moved Temporarily\r
-Location: http://www.some.where.else.com/\r\n\r\n";
-
-


+

The best option is to use the standard Apache module mod_rewrite. +If that module is compiled in, the following lines: + +

RewriteEngine On
+RewriteRule /.* http://www.apache.org/ [R]
+
+ +This will send an HTTP 302 Redirect back to the client, and no matter +what they gave in the original URL, they'll be sent to +"http://www.apache.org". + +The second option is to set up a ScriptAlias pointing to +a cgi script which outputs a 301 or 302 status and the location +of the other server.

+ +

By using a cgi-script you can intercept various requests and +treat them specially, e.g. you might want to intercept POST +requests, so that the client isn't redirected to a script on the other +server which expects POST information (a redirect will lose the POST +information.) You might also want to use a CGI script if you don't +want to compile mod_rewrite into your server. + +

Here's how to redirect all requests to a script... In the server +configuration file, +

ScriptAlias / /usr/local/httpd/cgi-bin/redirect_script
+ +and here's a simple perl script to redirect requests: + +
+#!/usr/local/bin/perl
+
+print "Status: 302 Moved Temporarily\r
+Location: http://www.some.where.else.com/\r\n\r\n";
+
+

+ +

How to reset your log files

-Sooner or later, you'll want to reset your log files (access_log and +

Sooner or later, you'll want to reset your log files (access_log and error_log) because they are too big, or full of old information you don't -need.

+need.

-access.log typically grows by 1Mb for each 10,000 requests.

+

access.log typically grows by 1Mb for each 10,000 requests.

-Most people's first attempt at replacingthe logfile is to just move the -logfile or remove the logfile. This doesn't work.

+

Most people's first attempt at replacing the logfile is to just move the +logfile or remove the logfile. This doesn't work.

-Apache will continue writing to the logfile at the same offset as before the -logifile moved. This results in a new logfile being created which is just +

Apache will continue writing to the logfile at the same offset as before the +logfile moved. This results in a new logfile being created which is just as big as the old one, but it now contains thousands (or millions) of null -characters.

+characters.

-The correct procedure is to move the logfile, then signal Apache to tell it to -reopen the logfiles.

+

The correct procedure is to move the logfile, then signal Apache to tell it to reopen the logfiles.

-Apache is signalled using the SIGHUP (-1) signal. e.g. +

Apache is signaled using the SIGHUP (-1) signal. e.g.

-mv access_log access_log.old ; kill -1 `cat httpd.pid` +mv access_log access_log.old
+kill -1 `cat httpd.pid`
+

-Note: httpd.pid is a file containing the process id +

Note: httpd.pid is a file containing the process id of the Apache httpd daemon, Apache saves this in the same directory as the log -files.

- -Many people use this method to replace (and backup) their logfiles on a -nightly basis.


- -

How to stop robots

- -Ever wondered why so many clients are interested in a file called -robots.txt which you don't have, and never did have?

- -These clients are called robots - special automated clients which -wander around the web looking for interesting resources.

- -Most robots are used to generate some kind of web index which -is then used by a search engine to help locate information.

- -robots.txt provides a means to request that robots limit their -activities at the site, or more often than not, to leave the site alone.

- -When the first robots were developed, they had a bad reputation for -sending hundreds of requests to each site, often resulting in the site -being overloaded. Things have improved dramatically since then, thanks -to Guidlines -for Robot Writers, but even so, some robots may exhibit unfriendly -behaviour which the webmaster isn't willing to tolerate.

- -Another reason some webmasters want to block access to robots, results -from the way in which the information collected by the robots is subsequently -indexed. There are currently no well used systems to annotate documents -such that they can be indexed by wandering robots. Hence, the index -writer will often revert to unsatisfactory algorithms to determine what gets -indexed.

- -Typically, indexes are built around text which appears in -document titles (<TITLE>), or main headings (<H1>), and more -often than not, the words it indexes on are completely irrelevant or -misleading for the docuement subject. The worst index is one based on -every word in the document. This inevitably leads to the search engines -offering poor suggestions which waste both the users and the servers -valuable time

- -So if you decide to exclude robots completely, or just limit the areas -in which they can roam, set up a robots.txt file, and refer -to the robot -exclusion documentation.

- -Much better systems exist to both index your site and publicise its -resources, e.g. -ALIWEB, which -uses site defined index files.

+files.

+ +

Many people use this method to replace (and backup) their logfiles on a +nightly or weekly basis.

+
+ +

How to stop or restrict robots

+ +

Ever wondered why so many clients are interested in a file called +robots.txt which you don't have, and never did have?

+ +

These clients are called robots (also known as crawlers, +spiders and other cute name) - special automated clients which +wander around the web looking for interesting resources.

+ +

Most robots are used to generate some kind of web index which +is then used by a search engine to help locate information.

+ +

robots.txt provides a means to request that robots limit their +activities at the site, or more often than not, to leave the site alone.

+ +

When the first robots were developed, they had a bad reputation for sending hundreds/thousands of requests to each site, often resulting in the site being overloaded. Things have improved dramatically since then, thanks to Guidelines for Robot Writers, but even so, some robots may exhibit unfriendly behavior which the webmaster isn't willing to tolerate, and will want to stop.

+ +

Another reason some webmasters want to block access to robots, is to +stop them indexing dynamic information. Many search engines will use the +data collected from your pages for months to come - not much use if your +serving stock quotes, news, weather reports or anything else that will be +stale by the time people find it in a search engine.

+ +

If you decide to exclude robots completely, or just limit the areas +in which they can roam, create a robots.txt file; refer +to the robot information pages provided by Martijn Koster for the syntax.

-- 2.40.0