Managing
Logging with Numerous Virtual Hosts in Apache
Jay Ribak
A As a systems administrator for
a Web hosting company that does a lot of shared virtual hosting
(using Apache's virtual host capabilities), I have to deal
with log files for 500 or more virtual hosts per Web server. Customers
require their own log files per domain, which can then be fed to
various log analysis programs to generate site statistics. This
article describes our homegrown solution to dealing with the technical
and administrative issues that arise when you have hundreds or thousands
of log files, which can amount to several gigabytes of space per
month.
An early problem we encountered with the 2.0 series of the Linux
kernel was dealing with large numbers of file descriptors. Apache
opens one file descriptor per open log file. As the number of virtual
hosts increases, each with its own access and error logs, the number
of available open file descriptors decreases. We experienced some
unusual problems when we hit the threshold -- specifically,
Apache refused to spawn external interpreters to handle CGI scripts.
The 2.0 kernels had static limits built into the kernel regarding
the number of open file descriptors each process was allowed to
access. At the time, the only solution was an unsupported kernel
patch and a recompile of the kernel. We chose not to use unsupported
kernel patches, so we needed a different solution. Of course, as
newer kernels were released, this was no longer an issue because
file descriptor modifications could be made through the /proc
filesystem.
The second issue with the log files was both a security and storage
issue -- where would all of these logs reside? Many hosts place
them in the user's home directory. Since Apache must start
as root in order to bind to a privileged port, all of the log files
would be owned by root. Having a root-owned file in a user's
home directory was a security risk we did not want.
The amount of space required to store individual logs for hundreds
of domains turned out to be greater than we had anticipated. Of
course, the size of the logs is directly related to the number of
hits each domain receives, and there will probably be a few large
hits in any large group of domains. Our solution allows us to place
all of our logs on a dedicated device where tracking and managing
space is easier. A 5-GB partition suffices to hold the logs for
a single month for most of our Web servers. In some situations,
an entire 9-GB disk is dedicated to log storage.
The Solution (Overview)
We arrived at fairly simple solution that didn't involve
too much re-engineering of current code and administrative processes.
Apache continues to log to the traditional access_log, which is
then later split into logs for each respective virtual host once
per day. This solution already solves the file descriptor problem
because each virtual host does not need individual log files. Since
each virtual host block in httpd.conf inherits its logging from
the parent server and does not need individual TransferLog and ErrorLog
statements, a benefit of this solution is a significantly smaller
httpd.conf file.
With Apache logging to a single access_log file, how are the logs
split into their respective parts? The Apache distribution ships
with a useful Perl script, called split-logfile. Split-logfile can
be found in the src/support directory of the Apache source
tree. When presented with an access_log of a particular format,
split-logfile splits the log apart and creates (or appends) individual
logs for each virtual host. When the virtual host log already exists,
split-logfile simply appends to the end of it, allowing continuity
over a period of time in the individual virtual host logs.
Once split, the individual log files can be manipulated in the
same manner as a standard access_log. In fact, they are identical
to standard logs. As such, they can be run through log analyzers
or other programs. The logs have the added benefit of not being
open file descriptors, so they can be edited, moved, or deleted
without having to signal the httpd daemon. The only drawback to
this solution is that the individual virtual host logs are not updated
in real time, but only as often as the main log is split. (In our
case, this is once per day, at midnight.)
For the split-logfile script to recognize which entries belong
to which virtual hosts, the format of the main access_log must be
slightly modified to add an extra field to the beginning of the
log entry. Because this extra field could confuse log analysis programs,
split-logfile removes this field as it is splitting the logs, thereby
returning the logs back to a standard format.
The Solution (Specifics)
You must have a basic understanding of Apache's mod_log_config
module to understand the specifics of our modifications. The mod_log_config
module allows the administrator to have granular control over how
the daemon handles logging. Using the LogFormat directive, the administrator
can specify the exact format of log entries. A few default examples
are provided in the standard httpd.conf when Apache is installed.
One of the examples provides a combined referrer and agent log.
This format is popular with many Webmasters because it provides
the most information about who is visiting a site, their origins,
and their platforms. The default format for the combined log is:
LogFormat "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\"" combined
The LogFormat statement is used to specify the format of the logs.
Specification of the location of the log file and the specific format
to use falls to the CustomLog statement. The CustomLog directive takes
two arguments -- the first is the full path of the log file, and
the second is the name given to the LogFormat to be used (in the case
above, this is combined). To complete the above example, a
sample CustomLog directive would be:
CustomLog /usr/local/apache/var/log/access_log combined
With this basic understanding of how to customize log file formats,
we can get into the specific changes needed for split-logfile. As
mentioned, the script requires an extra field at the beginning of
each log entry, specifying the virtual host to which the log entry
belongs. This is easily done by adding the token %v to the
beginning of the LogFormat directive. The newly modified combined
log with the extra virtual host field appears as follows:
LogFormat "%v %h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\"" combined
Because most log analysis programs do not recognize the format of
the log files with the additional virtual host field at the beginning,
split-logfile strips out this field as it processes the logs, leaving
a standard combined format log file.
Once the proper log configuration was completed, we devised a
directory structure to use for the new logging methods. This is
the place where local admins can customize things to their own site
policies and practices. Our local practice is to use the GNU-style
layout for Apache; thus, Apache lives in /usr/local/apache,
and the main logs are in /usr/local/apache/var/log. Within
this directory, we created two additional directories -- old-logs
and cust-logs (short for customer logs). The old-logs directory
is used to store the (compressed) large, pre-split access_log. Local
retention policies and tape archiving policies will determine how
long to keep these logs. Our policy is to keep only the current
month's logs here, and older pre-split log files can be restored
from tape if necessary.
We use the cust-logs directory to store the actual virtual host
log files in their post-split state. This directory is a mount point
for a large partition that holds the individual logs. Split-logfile
creates logs based on the ServerName directive of the virtual host
with .log appended to the end. For example, if there were
a virtual host named domain.com, the log file produced would
be called domain.com.log. An idiosyncracy of the split-logfile
script is that it deposits its output (the split log files for each
virtual host) in the directory in which it resides, thus making
it necessary to store the split-logfile script in the cust-logs
directory.
While the split-logfile script is extremely useful, it doesn't
achieve its true potential without an external wrapper program that
handles all aspects of log file rotation and archival, which is
where the logcron script comes into play. Logcron (Listing 1) is
a fairly simple Korn shell script that was developed at my company.
One of the great things about logcron is that is easy to modify
to handle additional tasks if necessary.
The logcron Script
The script is intended to be run through the cron facility on
a daily basis, or at an interval desired by the local administrator.
Please note that if the script is to be run more than once daily,
some minor modifications are required because the script writes
files using the year, month, and day for identification. A simple
change can be made to also utilize the time of day so that filenames
are unique. Change the line:
DATE='/bin/date +%y%m%d'
to another format, which adds some other unique identifying characteristic,
such as the hour and minute. This new format (or any other format
the administrator prefers) can appear as:
DATE='/bin/date +%y%m%d%H%M'
The script is heavily commented and should be easy to follow. The
basic flow of logcron is:
1. The access_log and error_log files are copied to the archive
directory (old-logs) and renamed with the date appended to the file
to make the filename unique. Note that at this time, Apache is still
writing to these files.
2. New, blank logfiles are created in the standard location.
3. The httpd daemon is restarted with the USR1 signal. This signal
tells the daemon to gracefully close the old file descriptors and
begin writing to the new ones. When dealing with log rotation, SIGUSR1
is preferred over SIGHUP.
4. Split-logfile is run on the archived access_log. The split
logs are deposited in the cust-logs directory by the split-logfile
script.
5. The original archived logs are gzipped for more efficient storage.
At our site, we run logcron once daily at 11:55 PM because it
can take a few minutes to copy the monolithic log file to the archive
directory and restart Apache. Thorough instructions for running
the script from cron are provided in the comments of the script.
The logcron script is written to allow unlimited local customization
and expansion, such as using the script to spawn log analysis programs
or bandwidth monitoring tools. If you do not perform reverse DNS
lookups during the course of normal logging (the HostnameLookups
directive is off in httpd.conf), the lookups can be performed through
logcron before the monolithic log file is passed to split-logfile.
In fact, this is a change that we made to our procedures later in
order to maximize performance. It is also possible to use logcron
to perform other specific log maintenance, such as removing old
logs on a monthly basis or to do general Apache maintenance on a
specific timeline.
Through the use of a Perl script included with Apache, a simple
custom-written Korn shell script, and a few simple modifications
to Apache's configuration directives, we have streamlined the
task of virtual host logging at our site. We have not only achieved
respite from the problem of running out of file descriptors (which
has the potential to bring down or cripple hundreds or thousands
of sites), but we have also turned logging into a centralized, organized
procedure. Rather than having logs scattered throughout the file
system, they are now all maintained on a dedicated disk device where
analysis and maintenance is simplified. The project helped us achieve
our goal to have a more robust and easily customized environment.
Jay Ribak has been a systems administrator working with various
flavors of UNIX (with an emphasis on Linux, Solaris, and FreeBSD)
for the past seven years. He is a founding partner and lead network
engineer of a small Web hosting company, Web Serve Pro. Jay can
be reached via email at: jay@webservepro.com.
|