Using Apache as an Internet Proxy Server
Over the past five years, the popularity of the Internet, and its core technologies, has driven nearly every organization that uses information technology toward utilization of the Internet in some fashion. For most, this means connecting the organization's internal network to the Internet, either to provide access to the Internet from within, or to provide access to internal systems, such as Web servers, to the public. Often, it's both.
This presents a challenge to network engineers, systems administrators, and network security specialists. They are now faced with the additional task of building and managing their network and systems infrastructure to accommodate the technologies, the security implications, and the increased demand for network bandwidth and system resources.
An important, often critical, piece of this infrastructure is the proxy server. Proxy servers are simply intermediary servers that accept requests from a client, and direct those requests to another server on behalf of that client. There are basically two types of proxy servers. Circuit-level proxy servers operate at a low level, and have no application-specific knowledge. The circuit-level proxy sets up a "circuit" between the client originating the request and the destination server. The most common circuit-level proxy is SOCKS, a common component of many firewalls. An application-level proxy has "intelligence" about the application protocol being proxied. This allows the application-level proxy to interpret and act on information within the protocol, providing for more sophisticated capabilities for security, application-specific filtering, logging, and potentially reducing bandwidth and latency by caching data.
A Web proxy server is an application-level proxy that services the HTTP protocol and typically ftp as well. The Web proxy server will be an important part of any network that is connected to the Internet. Web proxy servers are also key components of good firewall design and potentially provide great benefit to remote LANs within an organization by deploying multiple proxy servers in these locations. Installing several Web proxy servers need not be an expensive undertaking. By using relatively inexpensive Intel-based PCs, running a freely available operating system such as Linux, and a freely available proxy server such as Apache, you can build a sophisticated network of proxy servers on a slim budget.
Why A Proxy?
I mentioned four key areas where an application-level proxy such as a Web proxy can be of benefit over a circuit-level proxy. By taking advantage of enhanced security capabilities, application filtering, logging, and bandwidth reduction, you can better manage the application, while providing a better level of service to users. How a proxy server can help you depends upon its features in each of these areas.
With a sufficiently feature-rich proxy server, you can authenticate users who attempt to access your proxy, giving you a measure of control over who has access to its resources. Perhaps security policy dictates that although everyone internally should have free access to your Intranet, only approved users with defined business reasons are granted Internet access. One way to accomplish this is by authenticating users at your proxy server.
There are security aspects of a proxy server beyond authentication that are also important. Perhaps you need to block access to specific sites for security or legal reasons. Corporations typically don't want employees browsing pornography sites, sites that contain hacker information, or perhaps even the many job search sites out there. Most proxy servers provide at least rudimentary site-blocking capabilities for just this purpose.
Site blocking features are typically regarded as security features, however they are part of a much richer feature set found in most Web proxy servers: application filtering capabilities. Other filtering capabilities allow a proxy to do site mirroring (or reverse proxying), URL remapping, virus scanning, and filtering based upon specific content such as HTML tags, Java or ActiveX. If your needs in any of these areas are very sophisticated, there are proxy products specifically tailored to things like site blocking or virus scanning. Only you can decide whether your network needs separate servers, and software, for such specialized functions.
Caching capability is a primary reason for deploying Web proxy servers. A caching Web proxy will cache data it retrieves on behalf of the client. The server can then service browser requests of frequently hit data from its cache, rather than having that data retrieved from the actual Web server it resides on. Caching reduces bandwidth consumption on your Internet link, or a WAN link, and decreases latency - delivering Web pages to the end user faster. There is a risk involved with caching, however, as your users may receive stale data in the form of out-of-date Web pages. This risk is mitigated by the way the proxy server uses the response headers in the HTTP/1.0 protocol.
When a Web server delivers an object, be it an HTML file or an image, it can deliver the following directives in the HTTP response header,
The Last-Modified: directive specifies to the requesting client the time and date of the last modification of the object. This timestamp is subsequently used by the proxy server to determine whether a cached object needs to be updated.
The Expires: directive specifies the time and date after which the object is considered expired, or stale.
Finally, Pragma: no-cache specifies to the requesting client that the object should not be cached.
When using a proxy server, the proxy acts on behalf of the client browser, retrieving an object from the requested origin server. The proxy server then stores that object in its cache, using the value of Last-Modified: returned in the response header as a timestamp for that file. Subsequent requests for the object by clients will cause the proxy to perform a conditional GET request. When your proxy makes such a request to an origin server, the Last-Modified: information of the object already in its cache is passed back to the original Web server as part of the request header. In this type of request, a special header directive, If-Modified-Since, is sent to the original server, along with the value of the Last-Modified directive given to the proxy when the object was originally cached.
If the Web server sees that the current version of the object has not been modified since that time, it simply returns HTTP response 304. This indicates to the requestor (your proxy) that the object has not been modified since that time. The proxy will then serve its cached copy directly to the requesting browser. If the object has been modified more recently, it is returned to the proxy, which updates its cache and returns the object to the requesting browser. Note that the conditional GET only makes one connection to the origin server to determine the staleness of the copy in cache and to update it if needed. The origin server either returns HTTP response code 304 or code 200 and the actual data for the requested object.
Any or all of the response header elements can either be sent or left out of the response header. Some Web servers will leave out the Last-Modified: directive, include a Pragma no-cache directive, or manipulate the Expires directive to minimize the time an object spends in proxy caches, or to prevent it from being cached at all. This may be done to ensure visitors to a Web site always get the freshest views of online advertisements, etc. There are configuration capabilities in most proxy servers that allow administrators to specify how objects with no Expire: or Last-Modified: times should be handled. I'll touch briefly on the specifics of how Apache handles this later in this article.
The model described above for cache consistency and freshness is basically an on-command model. When the proxy receives a client request, then and only then is the cached copy of an object checked and perhaps updated in the cache. Some proxy servers also have an on-demand model, which can be used to pre-fetch cached objects during times of low utilization. I consider this an interesting, if not wholly necessary feature, especially if your cache-hit rate is high.
The cache-hit rate is the percentage of requests received by your proxy that can be satisfied without loading, or reloading, the object from its origin server. Luotonen (see References) contends that "cache hit rates with HTTP/1.0 vary anywhere between 20 and 70 percent, usually around 60 percent", and says that as the number of users hitting your proxy goes up, the cache hit rate will increase.
Clearly, the more users, the more likely it is that multiple users will request the same Web page, thus generating cache hits. Consider a network that typically brings 2 GB of HTTP data across its Internet link per day. By adding a caching proxy, and getting a 40% cache hit ratio, you reduce the utilization of your Internet link by as much as 800 MB per day, and deliver that 800 MB of data to the end user more quickly. A 40% cache hit rate won't necessarily equate to a 40% reduction in bandwidth (that would assume all cache hits returned data of the same size), but the example illustrates the potential benefit.
Before I discuss the specific features of the Apache proxy, keep in mind that a Web proxy server is not a firewall. It is an important part of a good firewall design, but not necessarily the only Web proxy server on a network, especially a large network with multiple LANs connected via WAN links. Figure 1 shows a fairly simplistic network proxy architecture, with a firewall proxy, and several remote proxy servers to service users on remote LANs connected via WAN links. I will refer to these as "remote proxies" throughout this article, and where necessary, to the "firewall proxy" as that component of a proxy architecture that resides within the firewall network.
There are a variety of ways to build distributed proxy networks, but exploring all of the options is beyond the scope of this article. The example in Figure 1 is a simple "daisy chain" architecture. Expensive commercial proxy server software usually provides more sophisticated proxy architectures using features such as inter-proxy communication, which allows creation and maintainence of distributed caches. Two such inter-proxy communication protocols, ICP (Internet Cache Protocol), and CARP (Cache Array Routing Protocol) provide different approaches. ICP uses a parent-child relationship among proxy servers. Proxies using ICP will query sibling proxies (children of the same parent), looking for the requested page or file, and if not found, will query the parent. The parent may in turn have siblings of its own to query. CARP on the other hand provides something more like a redundant array of proxies, where every proxy is a mirror of the others. This solution allows for very scalable, highly available proxying.
In any solution, you may in fact end up with proxy servers that only accept requests from other proxies, and only make requests of other proxies - never actually communicating directly with browsers or origin servers. The point is that the best solution depends on your network size and topology. It's up to you to decide how Apache might fit your needs.
How Apache Can Fit Your Proxy Needs
Previously, I listed security, filtering, bandwidth reduction, and logging capabilities as key areas of functionality in a Web proxy server. Although Apache has capabilities in each, some are limited. You must evaluate your own requirements in each area before choosing a proxy server.
As I stated before, a proxy server is not, by itself, a firewall. Proxy servers usually are an important part of a firewall, but their security features, among other things, are not sufficient to make a firewall out of a proxy server alone. However, a proxy server's usefulness extends well beyond its typically limited security features. These security capabilities typically center around access control - controlling who is allowed to use the proxy, as well as the ability to work with the firewall to provide access to the Internet. The Apache proxy has some typical access control measures. These are actually a standard part of the core Apache Web server features and HTTP/1.0 security, rather than being specific to the proxy module. You can control access to the proxy by configuring it to allow or deny access based upon the IP address of the client making the request. This is of limited value, in my opinion, as more and more organizations move to the use of DHCP, resulting in individual workstations having different IP addresses from time to time.
You're not likely to encounter problems making Apache work well with your firewall, whether the firewall uses circuit-level proxying such as SOCKS or has an HTTP proxy of its own. If your firewall has its own HTTP proxy, you may simply need to configure the ProxyRemote directive of Apache (details below) to pass your remote proxy requests on to the firewall's proxy. This is an example of proxy "daisy chaining". If your firewall uses SOCKS, you'll need to do some extra work when you compile and install the Apache proxy server. SOCKS is a circuit-level proxy, rather than an application-specific proxy like Apache, allowing it to proxy multiple, perhaps obscure, application protocols. The Apache Server Project Web site mentions that the proxy module is compatible with SOCKS4 proxy servers. If you are in this situation, consult the online documentation and the INSTALL and README files for further details.
Apache also has filtering capabilities. The most common filtering performed by a Web proxy is site blocking, which is accomplished by matching URLs requested by clients against a list of blocking rules. This list is simply a list of words, potentially containing wildcards, that might be found in a proxied URL. If a word from the list is found in the URL, access is not permitted. (See the example for ProxyBlock given below.)
The Apache proxy also has directives that provide for site mirroring (making a local server appear to be a mirror of a remote one), and reverse proxying (providing Internet users access to a Web server behind your firewall). Reverse proxying is essentially the reverse of what a proxy is normally used for. Luotonen covers reverse proxying and mirroring in better detail than I can here.
I sought an inexpensive caching proxy. Apache's caching capabilities are standard fare, and I found them to be complete enough for my purposes - minimizing the amount of data that actually had to be retrieved across a relatively slow Internet link. Apache has several configurations that control such things as cache size, frequency of cache clean up (garbage collection), and more. If caching is your most important concern, Apache gives you about as much control as you can get with any HTTP/1.0 protocol Web proxy. The HTTP/1.1 protocol has considerably more in its specification related to proxying, including caching. I look forward to the Apache Server Project updating the proxy module to HTTP/1.1.
What Can't Apache Do?
Apache doesn't seem to have a useful means by which individual user authentication takes place. Although the HTTP/1.0 protocol has some limited user authentication features, I could find no specific information about how to use these in the context of the Apache proxy server. I would like to see some capability to integrate with a directory service such as NIS, NDS, or LDAP. If sophisticated user authentication features or integration with an enterprise wide directory service are required, Apache is probably not the best choice.
The Apache proxy is somewhat limited in cache control capabilities, but only in the sense that the HTTP/1.0 protocol provides relatively unsophisticated proxy caching features. As an HTTP/1.0 proxy server, Apache is limited to working with the Last-Modified, Expires, and Pragma: no-cache response header directives. These, however, should be enough for most caching needs.
Inter-cache communication, such as provided by ICP or CARP, is non-existent in the Apache proxy. This is a fairly complex area, and I did not expect such a level of sophistication from free software. This does not mean you can't use Apache to create a useful architecture of distributed proxy servers. Simple "daisy chaining" of proxy servers can be used to create an environment where caching proxies are nearest to the end users.
Apache is also strictly an on-demand caching proxy. A document is retrieved and cached when a request from a client is received. That document is refreshed only when another demand is received, and it is determined that the document must be refreshed from its origin server before it is returned. There are no facilities to perform pre-fetching or on-command caching.
Configuring and Running an Apache Proxy Server
First, download an Apache source distribution that is appropriate for your system from The Apache Server Project Web site:
The Apache Server Project makes binary distributions available for a large number of UNIX systems. Proxying capabilities are accomplished through an Apache module. Several modules are available for Apache. Refer to the Web site above for descriptions. Since Apache modules are built into the server at compile time, you'll have to get the source distribution to compile in the proxy module.
The most recent version of Apache as of this writing is Apache 1.3.2. Follow the download link on the Web site above. From there, select the file you wish to download. Choose the gzipped version (with the .gz extension), if you have access to gzip/gunzip on your system, to save some download time.
Once you have downloaded the source distribution, uncompress it. For example, use:
% gunzip apache_1.3.2.tar.gz
for the gzipped version, or
% uncompress apache_1.3.2.tar.
if you downloaded the normal compressed version. Extract the contents of the resulting tar archive:
% cd /usr/local/src
% tar -xvf /tmp/apache_1.3.2.tar
Change the paths accordingly for your installation.
The Apache Server Project Web site includes documentation for compiling and installing the distribution. I found this to be somewhat incomplete, and in one instance wrong, so I'll go through a simple scenario. For excruciating detail, see the various README and INSTALL files at the root of the source distribution you just created. I'm compiling on a Linux system (Slackware 3.5). Your mileage may vary.
Change your current directory to the src directory of the Apache distribution root. In my case, that's:
% cd /usr/local/src/apache_1.3.2/src
List the directory, and you will see a file named Configuration.tmpl. Make a copy of this file before you continue. The Configuration.tmpl file contains lines that control which modules will be compiled into your Apache server. A default set of modules is already uncommented, and all I needed to do was uncomment (by removing the leading # character) the line reading:
Next, change directory back to the distribution root and run the configure script provided:
% ./configure --prefix=/usr/local/prox-apache
The --prefix option will cause the Makefile that is generated to install Apache in the specified directory and configure the Apache binaries to look in that directory for runtime configuration files, log files, etc. I maintain two Apache directories in /usr/local, one with the proxy module compiled as above, and another without - that runs my origin Web server. That way, each has separate configuration files, log files, etc.
After running the configure script, compile and install the server using make. In the same directory, run:
Assuming everything compiles successfully, run:
% make install
Check the directory specified on the configure command line above. You should see something similar to:
% ls /usr/local/prox-apache
bin etc include libexec man sbin share var
You can now start and stop your Apache proxy using the commands:
% /usr/local/prox-apache/sbin/apachectl start
% /usr/local/prox-apache/sbin/apachectl stop
Before your Apache proxy will be very useful to you, though, you'll want to edit the runtime configuration files and set up a few of the proxy module directives. Every Apache module has a list of configuration directives, along with the basic server configuration directives. For a simple Web server, these would control things such as the TCP port your server listens on, the Web site document root, CGI bin directory, and other virtual directories. For the proxy module, there are directives for turning on and off proxying, controlling the cache, blocking access to sites, etc. I will cover many of these directives in detail later. Let's get the basics done for now.
Change to the newly installed Apache server configuration directory:
% cd /usr/local/prox-apache/etc
Make a copy of the httpd.conf file, so you will have a baseline configuration to fall back on. Next, using your favorite editor, edit httpd.conf. The first thing to consider is what port you want your proxy to service requests on. By default the proxy will be configured to use port 80, the default HTTP server port. If your machine is also running a content Web server, you'll have to configure your proxy to use a different port. You should choose one above 1000. A common choice is 8080. So, change the line that reads:
You can now simply uncomment several directives related specifically to the proxy server. Find and uncomment lines that have the following directives:
Once you've saved your changes, you're almost ready to fire up your proxy server. First, however, you'll need to check and perhaps change the permissions on the directory given in the CacheRoot directive above. On my system, this directory was owned by root with 755 permissions (rwxr-xr-x). Since any good (and secure) Web server, including this proxy, does not run as root (by default on my Linux system, it runs as nobody), your proxy won't have permission to cache documents unless you change permissions on this directory. It will still proxy requests, it just won't cache what it retrieves. When the permissions are set, you're ready. Start up your proxy:
% /usr/local/prox-apache/sbin/apachectl start
Check that it is running using the ps command:
% ps -wax | grep httpd
84 ? S /usr/local/prox-apache/sbin/httpd
85 ? S /usr/local/prox-apache/sbin/httpd
86 ? S /usr/local/prox-apache/sbin/httpd
87 ? S /usr/local/prox-apache/sbin/httpd
88 ? S /usr/local/prox-apache/sbin/httpd
89 ? S /usr/local/prox-apache/sbin/httpd
Now, configure a browser on a workstation on your network to use your server as a proxy and try visiting a few sites. Your success at this point is entirely dependent upon your network architecture, your firewall, and how you are connected to the Internet. If this proxy is to be your firewall proxy, as opposed to a remote proxy somewhere on your LAN, everything should function properly (assuming the system on which you installed the Apache proxy is properly configured for communication on your Internet link). If you are installing a remote proxy, check with your network or firewall administrator for information about your firewall. You'll need to determine whether your firewall has its own Web proxy, or whether it uses a SOCKS circuit-level proxy. If it uses an HTTP proxy, you'll need the hostname of the proxy server system and the port number it uses for proxying HTTP. You'll need this information when you configure your Apache proxy, using the ProxyRemote directive. I'll discuss this directive later.
There are 20 different directives used specifically by the Apache proxy module. So far, I've mentioned only seven. I won't go into every directive in detail here. See:
for descriptions. I will cover the ones we've seen so far, and a few others that may be of interest.
The first one mentioned above, ProxyRequests, is a simple binary directive, taking either On or Off as a value. If set to Off, your server will not proxy requests. To turn on proxying, set this to On by including the line:
in your httpd.conf file. The remainder of the directives that I uncommented during initial configuration control various caching features of the Apache proxy. CacheRoot specifies the local directory where cached results are stored. CacheSize specifies the size, in Kbytes, of your cache. This is a loose specification, as the cache may grow beyond this value. Your proxy server can routinely perform garbage collection on the cache. During garbage collection, the proxy will delete files until the disk space consumed by the cache is at or below the value specified for CacheSize. Garbage collection itself is controlled by the CacheGCInterval directive, to which you provide a value, in hours (which can be a floating point number), specifying how often the server checks the size of the cache and performs garbage collection as necessary. Adjust your CacheSize and CacheGCInterval accordingly, taking into account how much disk you have available and the fact that the cache will grow above CacheSize between garbage collection runs. If you don't set CacheGCInterval, your cache will be bounded only by available disk space.
The CacheMaxExpire directive specifies the maximum amount of time objects will be retained in the cache. After this period, requests for the object will automatically cause the proxy to update its cache from the object's origin server.
The CacheLastModifiedFactor directive is used as a "fudge factor" for documents that don't specify their own expiration date. The proxy will look at the last modified date, apply the specified factor, and use that as the expiration date. Some protocols, such as ftp, do not provide for the specification of an expiration date on documents served via that protocol. In such cases, you can use the CacheDefaultExpire directive to specify a period, in hours, after which cached documents retrieved via that protocol will be expired in your cache.
That's it for the basic configuration needed to get up and running. Feel free to tinker with the values used above and see the results. Every time you edit httpd.conf, you'll need to restart your proxy server, using:
% /usr/local/prox-apache/sbin/apachectl restart
to refresh the configuration information.
There are a few other directives I'll discuss briefly, as they may be important depending upon your circumstances. I'll leave the configuration and use of these up to you.
The ProxyRemote directive may be important if you are using Apache as a remote proxy. This directive, which can have multiple instances in httpd.conf, is used to control the behavior when your proxy must in turn make requests of other proxies, such as your firewall proxy. You may also have dedicated application proxy servers in your firewall network for several application protocols, such as ftp or gopher. For example, I may have a dedicated ftp proxy in my firewall that must be used for all ftp access to the Internet. If you have set up Apache as a remote proxy for your building and are using Apache to proxy both ftp and HTTP requests, you may wish to configure Apache with something like the following:
ProxyRemote ftp http://prox-ftp.somedomain.com:8080
ProxyRemote http http://prox-http.somedomain.com:8080
This will ensure that your Apache proxy communicates with the right firewall proxy servers for each protocol before the request goes out to the Internet.
It may also be the case that all of your proxy requests should simply go to another proxy. That proxy may be your firewall, or it could be another proxy to which multiple remote proxies must make requests, as with the "Regional Remote Proxy" in Figure 1. You may then need to use something like:
ProxyRemote * http://proxy.northwest.somedomain.com
in all remote proxies in the northwest region. This is a contrived example, but the point is that there are numerous ways that a larger organization can build a multi-proxy network, and the ProxyRemote is how you tie your Apache proxy servers together.
Another important directive, AllowCONNECT, is used to specify the TCP port numbers used by the CONNECT request. The CONNECT request is used by browsers when there is a connection request to a secure HTTP server, as indicated by https in a URL. The default, and minimum, configuration, if you intend to proxy requests to secure servers, is:
This allows proxying of secure connections to the default https server port - 443. Without this, users of your proxy service will not be able to connect to secure sites, which are becoming more common as use of the Internet for conducting electronic commerce increases.
The last directive I'll cover is ProxyBlock. The values given to this directive are a list of words, hosts, or domains, allowing for wildcarding, to which your proxy server will block access. I won't presume to guess what you may need to block, but here's a simple contrived example:
ProxyBlock *xxx* reallybadstuff.com
This will block access to anything with xxx in the URL, as well as anything in the reallybadstuff.com domain. Because URL's themselves are not case-sensitive, the case of the words given as arguments to ProxyRemote is unimportant. Telling Apache to block access to ReallyBADstuff.com is the same as blocking access to reallybadstuff.com.
Note that if an attempt to access an otherwise blocked site is made using an IP address rather than a domain name, Apache will still be able to make a match against the list in the ProxyBlock directive. A reverse DNS lookup of the IP address is performed to determine the domain name, then the URL is matched against the list. To block access to all sites, which effectively turns off proxying, use:
Once you have your proxy suitably configured, add a startup script in the appropriate place, such as /etc/rc.d, or in the /etc/rc.local file, so that your proxy will start at boot time. Then, begin configuring your users' browsers to use the new caching proxy, and they'll be amazed at the newfound speed you've squeezed out of your limited bandwidth Internet link.
I've only touched the surface of many of the important capabilities of a good Web proxy server. For a brief list of commercial proxy servers, refer to the sidebar "Proxy Servers". For more detailed coverage, see the References at the end of the article. Ari Luotonen's book deserves particular recognition for raising my level of knowledge on proxying. I highly recommend it as a resource regardless of what proxy server you choose to implement.
Ar i Luotonen, Web Proxy Servers. Prentice Hall, 1998.
D. Brent Chapman and Elizabeth Zwicky, Building Internet Firewalls. O'Reilly & Associates, 1995.
T. Berners-Lee, R. Fielding, and H. Frystyk, RFC 1945: Hypertext Transfer Protocol - HTTP/1.0, 1996.
About the Author
Chris Bush is a Programmer Analyst for a major mid-west bank in Cleveland, OH. He does Web site management, Web development, Internet/Intranet architecture, UNIX systems administration, and dabbles with Linux at home in his spare time. He can be reached at email@example.com.