Cover V11, I02

Article

feb2002.tar

Managing Logging with Numerous Virtual Hosts in Apache

Jay Ribak

A As a systems administrator for a Web hosting company that does a lot of shared virtual hosting (using Apache's virtual host capabilities), I have to deal with log files for 500 or more virtual hosts per Web server. Customers require their own log files per domain, which can then be fed to various log analysis programs to generate site statistics. This article describes our homegrown solution to dealing with the technical and administrative issues that arise when you have hundreds or thousands of log files, which can amount to several gigabytes of space per month.

An early problem we encountered with the 2.0 series of the Linux kernel was dealing with large numbers of file descriptors. Apache opens one file descriptor per open log file. As the number of virtual hosts increases, each with its own access and error logs, the number of available open file descriptors decreases. We experienced some unusual problems when we hit the threshold -- specifically, Apache refused to spawn external interpreters to handle CGI scripts.

The 2.0 kernels had static limits built into the kernel regarding the number of open file descriptors each process was allowed to access. At the time, the only solution was an unsupported kernel patch and a recompile of the kernel. We chose not to use unsupported kernel patches, so we needed a different solution. Of course, as newer kernels were released, this was no longer an issue because file descriptor modifications could be made through the /proc filesystem.

The second issue with the log files was both a security and storage issue -- where would all of these logs reside? Many hosts place them in the user's home directory. Since Apache must start as root in order to bind to a privileged port, all of the log files would be owned by root. Having a root-owned file in a user's home directory was a security risk we did not want.

The amount of space required to store individual logs for hundreds of domains turned out to be greater than we had anticipated. Of course, the size of the logs is directly related to the number of hits each domain receives, and there will probably be a few large hits in any large group of domains. Our solution allows us to place all of our logs on a dedicated device where tracking and managing space is easier. A 5-GB partition suffices to hold the logs for a single month for most of our Web servers. In some situations, an entire 9-GB disk is dedicated to log storage.

The Solution (Overview)

We arrived at fairly simple solution that didn't involve too much re-engineering of current code and administrative processes. Apache continues to log to the traditional access_log, which is then later split into logs for each respective virtual host once per day. This solution already solves the file descriptor problem because each virtual host does not need individual log files. Since each virtual host block in httpd.conf inherits its logging from the parent server and does not need individual TransferLog and ErrorLog statements, a benefit of this solution is a significantly smaller httpd.conf file.

With Apache logging to a single access_log file, how are the logs split into their respective parts? The Apache distribution ships with a useful Perl script, called split-logfile. Split-logfile can be found in the src/support directory of the Apache source tree. When presented with an access_log of a particular format, split-logfile splits the log apart and creates (or appends) individual logs for each virtual host. When the virtual host log already exists, split-logfile simply appends to the end of it, allowing continuity over a period of time in the individual virtual host logs.

Once split, the individual log files can be manipulated in the same manner as a standard access_log. In fact, they are identical to standard logs. As such, they can be run through log analyzers or other programs. The logs have the added benefit of not being open file descriptors, so they can be edited, moved, or deleted without having to signal the httpd daemon. The only drawback to this solution is that the individual virtual host logs are not updated in real time, but only as often as the main log is split. (In our case, this is once per day, at midnight.)

For the split-logfile script to recognize which entries belong to which virtual hosts, the format of the main access_log must be slightly modified to add an extra field to the beginning of the log entry. Because this extra field could confuse log analysis programs, split-logfile removes this field as it is splitting the logs, thereby returning the logs back to a standard format.

The Solution (Specifics)

You must have a basic understanding of Apache's mod_log_config module to understand the specifics of our modifications. The mod_log_config module allows the administrator to have granular control over how the daemon handles logging. Using the LogFormat directive, the administrator can specify the exact format of log entries. A few default examples are provided in the standard httpd.conf when Apache is installed. One of the examples provides a combined referrer and agent log. This format is popular with many Webmasters because it provides the most information about who is visiting a site, their origins, and their platforms. The default format for the combined log is:

LogFormat "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\"" combined
The LogFormat statement is used to specify the format of the logs. Specification of the location of the log file and the specific format to use falls to the CustomLog statement. The CustomLog directive takes two arguments -- the first is the full path of the log file, and the second is the name given to the LogFormat to be used (in the case above, this is combined). To complete the above example, a sample CustomLog directive would be:

CustomLog /usr/local/apache/var/log/access_log combined
With this basic understanding of how to customize log file formats, we can get into the specific changes needed for split-logfile. As mentioned, the script requires an extra field at the beginning of each log entry, specifying the virtual host to which the log entry belongs. This is easily done by adding the token %v to the beginning of the LogFormat directive. The newly modified combined log with the extra virtual host field appears as follows:

LogFormat "%v %h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\"" combined
Because most log analysis programs do not recognize the format of the log files with the additional virtual host field at the beginning, split-logfile strips out this field as it processes the logs, leaving a standard combined format log file.

Once the proper log configuration was completed, we devised a directory structure to use for the new logging methods. This is the place where local admins can customize things to their own site policies and practices. Our local practice is to use the GNU-style layout for Apache; thus, Apache lives in /usr/local/apache, and the main logs are in /usr/local/apache/var/log. Within this directory, we created two additional directories -- old-logs and cust-logs (short for customer logs). The old-logs directory is used to store the (compressed) large, pre-split access_log. Local retention policies and tape archiving policies will determine how long to keep these logs. Our policy is to keep only the current month's logs here, and older pre-split log files can be restored from tape if necessary.

We use the cust-logs directory to store the actual virtual host log files in their post-split state. This directory is a mount point for a large partition that holds the individual logs. Split-logfile creates logs based on the ServerName directive of the virtual host with .log appended to the end. For example, if there were a virtual host named domain.com, the log file produced would be called domain.com.log. An idiosyncracy of the split-logfile script is that it deposits its output (the split log files for each virtual host) in the directory in which it resides, thus making it necessary to store the split-logfile script in the cust-logs directory.

While the split-logfile script is extremely useful, it doesn't achieve its true potential without an external wrapper program that handles all aspects of log file rotation and archival, which is where the logcron script comes into play. Logcron (Listing 1) is a fairly simple Korn shell script that was developed at my company. One of the great things about logcron is that is easy to modify to handle additional tasks if necessary.

The logcron Script

The script is intended to be run through the cron facility on a daily basis, or at an interval desired by the local administrator. Please note that if the script is to be run more than once daily, some minor modifications are required because the script writes files using the year, month, and day for identification. A simple change can be made to also utilize the time of day so that filenames are unique. Change the line:

DATE='/bin/date +%y%m%d'
to another format, which adds some other unique identifying characteristic, such as the hour and minute. This new format (or any other format the administrator prefers) can appear as:

DATE='/bin/date +%y%m%d%H%M'
The script is heavily commented and should be easy to follow. The basic flow of logcron is:

1. The access_log and error_log files are copied to the archive directory (old-logs) and renamed with the date appended to the file to make the filename unique. Note that at this time, Apache is still writing to these files.

2. New, blank logfiles are created in the standard location.

3. The httpd daemon is restarted with the USR1 signal. This signal tells the daemon to gracefully close the old file descriptors and begin writing to the new ones. When dealing with log rotation, SIGUSR1 is preferred over SIGHUP.

4. Split-logfile is run on the archived access_log. The split logs are deposited in the cust-logs directory by the split-logfile script.

5. The original archived logs are gzipped for more efficient storage.

At our site, we run logcron once daily at 11:55 PM because it can take a few minutes to copy the monolithic log file to the archive directory and restart Apache. Thorough instructions for running the script from cron are provided in the comments of the script.

The logcron script is written to allow unlimited local customization and expansion, such as using the script to spawn log analysis programs or bandwidth monitoring tools. If you do not perform reverse DNS lookups during the course of normal logging (the HostnameLookups directive is off in httpd.conf), the lookups can be performed through logcron before the monolithic log file is passed to split-logfile. In fact, this is a change that we made to our procedures later in order to maximize performance. It is also possible to use logcron to perform other specific log maintenance, such as removing old logs on a monthly basis or to do general Apache maintenance on a specific timeline.

Through the use of a Perl script included with Apache, a simple custom-written Korn shell script, and a few simple modifications to Apache's configuration directives, we have streamlined the task of virtual host logging at our site. We have not only achieved respite from the problem of running out of file descriptors (which has the potential to bring down or cripple hundreds or thousands of sites), but we have also turned logging into a centralized, organized procedure. Rather than having logs scattered throughout the file system, they are now all maintained on a dedicated disk device where analysis and maintenance is simplified. The project helped us achieve our goal to have a more robust and easily customized environment.

Jay Ribak has been a systems administrator working with various flavors of UNIX (with an emphasis on Linux, Solaris, and FreeBSD) for the past seven years. He is a founding partner and lead network engineer of a small Web hosting company, Web Serve Pro. Jay can be reached via email at: jay@webservepro.com.