Parsing
and Summarizing a Logfile
Randal L. Schwartz
I recently put www.stonehenge.com behind a caching reverse-proxy,
and rather than switch technologies, I'm using another instance
of a stripped-down Apache server to do the job. But what kind of
job is it doing? How many of my hits are being cached and delivered
by the lightweight front servers, instead of going all the way through
to the heavy mod_perl_and_everything_else backend servers?
Luckily, I have included the caching information in the access
log file, thanks to the CustomLog and LogFormat directives:
LogFormat "[virt=%v] %h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \
\"%{User-Agent}i\" \"%{X-Cache}o\"" combined
CustomLog var/log/access_log combined
I have added a virtual host entry (for tracking) to the front of the
line, and the X-Cache header of the response to the end of
the line. Of course, doing so means my access log is not in a standard
format any more, so I can't use off-the-shelf tools for log analysis.
That's okay, because I'm pretty good at writing my own data-reduction
tools. A typical output line looks like this:
[virt=www.stonehenge.com] 192.168.42.69 - - [10/May/2002:01:51:50 \
-0700] "GET /merlyn/UnixReview/ HTTP/1.0" 200 101324 "-" \
"Mozilla/4.0 (compatible; MSIE 5.0; Windows NT; DigExt)" "MISS \
from www.stonehenge.com"
For my analysis, I wanted to see how many of those X-cache
fields began with HIT or MISS, indicating that the mod_proxy
module had gone all the way to the backend server, and either gotten
a good cache-able hit, or had to regenerate it. I also wanted the
data summarized on an hour-by-hour basis, in a CSV-style file so I
could pull it in to my favorite spreadsheet to do graphs and formulas.
So, I had a few problems to solve. I had to locate the data. I
had to read it line-by-line, parsing it into fields of interest.
And then I had to figure out what hour of the day the hit represented,
so I could collect it into the right bucket. And finally, I had
to spit out a CSV file for all the data from the first to last entry.
Locating the data was easy:
my $DIR = "/web/stonehenge-proxy/var/log";
@ARGV = <$DIR/access_log*> unless @ARGV;
while (<>) { ...
My access logs are all named access_log-something; I use a
nightly log-roller that leaves it in the same directory with a timestamp
appended. By storing the names into the @ARGV array, a simple
"diamond" loop will read them nicely.
Next, I parse each line:
my ($server, $stamp, $meth_url, $status, $bytes, $proxy) = /
^
\[virt=(\S+)\] \s+ # virtual servername
\S+ \s+ # source IP
\S+ \s+ \S+ \s+ # basicauth
\[(.*?)\] \s+ # request timestamp
"(.*?)" \s+ # method and URL and optional protocol
(\d+) \s+ # status
(\S+) \s+ # bytes
".*?" \s+ # referer
".*?" \s+ # useragent
"(.*?)" # proxy status
\s* \z # end of string
/x or warn("cannot parse $ARGV $.: $_"), next;
Wow. That's a lot. Compare this regex to the sample line I presented
previously, and the correlation should become clear. I'm using
an "extended format" regular expression so I can comment
each piece. I'm extracting more fields than needed, mostly because
I was playing around with the different combinations of types of hits
to the proxy server. Doesn't hurt to have more than you need,
sometimes.
Next, the date thing had to be sorted out. I wanted a key that
identified the particular hour of the hit. I did this by delegating
that job to a subroutine:
my $key = date_string_to_epoch_key($stamp);
My plan was to have a UNIX epoch integer as a response from this subroutine.
That was a pain, as I'll show shortly. Then, finally, I increment
this hit into one of three buckets:
if ($proxy =~ /^HIT/) {
$sums{$key}{hit}++;
} elsif ($proxy =~ /^MISS/) {
$sums{$key}{miss}++;
} else {
$sums{$key}{other}++;
}
This creates a "hash of hashes" (actually, a hash containing
hashrefs). The first level is the timestamp, while the second level
is really a C-like struct with constant names. Missing first-level
entries are created automatically as needed, thanks to the very cool
auto-vivification. The "other" bucket is for entries that
don't use mod_proxy, such as redirects or static images
served directly by the proxy server.
To round out the loop at the bottom, I want to report progress
and also keep the $. variable in sync. Simple enough, using
the recipe from the manpages:
} continue {
if (eof) {
close ARGV;
warn "done with $ARGV\n";
}
};
There. After that loop, the file processing is done. Then, I generate
the report. I'll step through all of the hourly timestamps, regardless
of whether a particular hour has a hit or not. Of course, if my Web
server has a zero hit hour, it's very likely that something is
seriously wrong, so I want to show a row with zeroes rather than skipping
that row. The quick and dirty way to get the minimum and maximum keys
was to sort the keys:
my ($minkey, $maxkey) = (sort { $a <=> $b} keys %sums)[0, -1];
And then we add the CSV header for easy labeling and import:
print "Date,Miss,Hit,Other\n";
Next, we'll dump. Hmm. I have a UNIX epoch time value, but I
want a human-readable format acceptable to my spreadsheet program.
The handiest tool for that job is the heavy-hitting Date::Manip
package, found in the CPAN, and its nice UnixDate formatting
routine. Yeah, there are probably 18 other ways to do that, but Date::Manip
usually does the job for me nicely.
use Date::Manip;
for (my $key = $minkey; $key <= $maxkey; $key += 3600) {
my $h = $sums{$key};
print join(",",UnixDate("epoch $key","%m/%d/%Y %H:%M:%S"),
map { $h->{$_} || 0 } qw(miss hit other)), "\n";
}
Note that I use || 0 to ensure that a missing entry actually
shows up as a zero entry.
And, finally, for that sticky part. I have to translate the timestamp
into an hour epoch value. Again, Data::Manip is helpful here.
I edit the string so that the minutes and seconds are zeroed out,
then pass that string to ParseDateString to extract the value.
However, if I did that on every hit, the time would be prohibitively
expensive. So I'll cache the result like so:
BEGIN {
my %cache;
sub date_string_to_epoch_key {
my $stamp = shift;
$stamp =~ s/(\d\d):\d\d:\d\d /$1:00:00 /; # reduce to hour
$cache{$stamp} ||= UnixDate(ParseDateString($stamp), "%s");
}
}
If a given $stamp exists as a key in %cache hash, then
the ||= operator stops, returning that value. However, if the
key is missing, then the (expensive) expression on the right produces
the value, which is both stored into the hash element and returned!
Very nice idiom, although it fails if the cached value just happens
to be zero, since the expensive expression will be recalculated each
time.
And the end result is a file on standard output looking a lot
like:
Date,Miss,Hit,Other
05/10/2002 01:00:00,30,2,26
05/10/2002 02:00:00,273,15,183
05/10/2002 03:00:00,224,14,108
05/10/2002 04:00:00,201,7,122
05/10/2002 05:00:00,558,28,474
05/10/2002 06:00:00,615,74,329
05/10/2002 07:00:00,819,61,352
05/10/2002 08:00:00,605,85,450
which my spreadsheet program pulls up just fine. Or, I could even
import it into a database and perform more complex queries. But at
least now I know how successful my proxy cache is at avoiding hits
to the fat backend processes. Until next time, enjoy!
Randal L. Schwartz is a two-decade veteran of the software
industry -- skilled in software design, system administration,
security, technical writing, and training. He has coauthored the
"must-have" standards: Programming Perl, Learning
Perl, Learning Perl for Win32 Systems, and Effective
Perl Programming, as well as writing regular columns for WebTechniques
and Unix Review magazines. He's also a frequent contributor
to the Perl newsgroups, and has moderated comp.lang.perl.announce
since its inception. Since 1985, Randal has owned and operated Stonehenge
Consulting Services, Inc.
|