Parsing and Summarizing a Logfile

Randal L. Schwartz

I recently put www.stonehenge.com behind a caching reverse-proxy, and rather than switch technologies, I'm using another instance of a stripped-down Apache server to do the job. But what kind of job is it doing? How many of my hits are being cached and delivered by the lightweight front servers, instead of going all the way through to the heavy mod_perl_and_everything_else backend servers?

Luckily, I have included the caching information in the access log file, thanks to the CustomLog and LogFormat directives:

LogFormat "[virt=%v] %h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \
   \"%{User-Agent}i\" \"%{X-Cache}o\"" combined
CustomLog var/log/access_log combined

I have added a virtual host entry (for tracking) to the front of the line, and the X-Cache header of the response to the end of the line. Of course, doing so means my access log is not in a standard format any more, so I can't use off-the-shelf tools for log analysis. That's okay, because I'm pretty good at writing my own data-reduction tools. A typical output line looks like this:

[virt=www.stonehenge.com] 192.168.42.69 - - [10/May/2002:01:51:50 \
  -0700] "GET /merlyn/UnixReview/ HTTP/1.0" 200 101324 "-" \
  "Mozilla/4.0 (compatible; MSIE 5.0; Windows NT; DigExt)" "MISS \
  from www.stonehenge.com"

For my analysis, I wanted to see how many of those X-cache fields began with HIT or MISS, indicating that the mod_proxy module had gone all the way to the backend server, and either gotten a good cache-able hit, or had to regenerate it. I also wanted the data summarized on an hour-by-hour basis, in a CSV-style file so I could pull it in to my favorite spreadsheet to do graphs and formulas.

So, I had a few problems to solve. I had to locate the data. I had to read it line-by-line, parsing it into fields of interest. And then I had to figure out what hour of the day the hit represented, so I could collect it into the right bucket. And finally, I had to spit out a CSV file for all the data from the first to last entry.

Locating the data was easy:

my $DIR = "/web/stonehenge-proxy/var/log";
@ARGV = <$DIR/access_log*> unless @ARGV;
while (<>) { ...

My access logs are all named access_log-something; I use a nightly log-roller that leaves it in the same directory with a timestamp appended. By storing the names into the @ARGV array, a simple "diamond" loop will read them nicely.

Next, I parse each line:

my ($server, $stamp, $meth_url, $status, $bytes, $proxy) = /
  ^
  \[virt=(\S+)\] \s+ # virtual servername
  \S+ \s+ # source IP
  \S+ \s+ \S+ \s+ # basicauth
  \[(.*?)\] \s+ # request timestamp
  "(.*?)" \s+ # method and URL and optional protocol
  (\d+) \s+ # status
  (\S+) \s+ # bytes
  ".*?" \s+ # referer
  ".*?" \s+ # useragent
  "(.*?)" # proxy status
  \s* \z # end of string
  /x or warn("cannot parse $ARGV $.: $_"), next;

Wow. That's a lot. Compare this regex to the sample line I presented previously, and the correlation should become clear. I'm using an "extended format" regular expression so I can comment each piece. I'm extracting more fields than needed, mostly because I was playing around with the different combinations of types of hits to the proxy server. Doesn't hurt to have more than you need, sometimes.

Next, the date thing had to be sorted out. I wanted a key that identified the particular hour of the hit. I did this by delegating that job to a subroutine:

my $key = date_string_to_epoch_key($stamp);

My plan was to have a UNIX epoch integer as a response from this subroutine. That was a pain, as I'll show shortly. Then, finally, I increment this hit into one of three buckets:

if ($proxy =~ /^HIT/) {
  $sums{$key}{hit}++;
} elsif ($proxy =~ /^MISS/) {
  $sums{$key}{miss}++;
} else {
  $sums{$key}{other}++;
}

This creates a "hash of hashes" (actually, a hash containing hashrefs). The first level is the timestamp, while the second level is really a C-like struct with constant names. Missing first-level entries are created automatically as needed, thanks to the very cool auto-vivification. The "other" bucket is for entries that don't use mod_proxy, such as redirects or static images served directly by the proxy server.

To round out the loop at the bottom, I want to report progress and also keep the $. variable in sync. Simple enough, using the recipe from the manpages:

} continue {
  if (eof) {
    close ARGV;
    warn "done with $ARGV\n";
  }
};

There. After that loop, the file processing is done. Then, I generate the report. I'll step through all of the hourly timestamps, regardless of whether a particular hour has a hit or not. Of course, if my Web server has a zero hit hour, it's very likely that something is seriously wrong, so I want to show a row with zeroes rather than skipping that row. The quick and dirty way to get the minimum and maximum keys was to sort the keys:

my ($minkey, $maxkey) = (sort { $a <=> $b} keys %sums)[0, -1];

And then we add the CSV header for easy labeling and import:

print "Date,Miss,Hit,Other\n";

Next, we'll dump. Hmm. I have a UNIX epoch time value, but I want a human-readable format acceptable to my spreadsheet program. The handiest tool for that job is the heavy-hitting Date::Manip package, found in the CPAN, and its nice UnixDate formatting routine. Yeah, there are probably 18 other ways to do that, but Date::Manip usually does the job for me nicely.

use Date::Manip;

for (my $key = $minkey; $key <= $maxkey; $key += 3600) {
  my $h = $sums{$key};

  print join(",",UnixDate("epoch $key","%m/%d/%Y %H:%M:%S"),
     map { $h->{$_} || 0 } qw(miss hit other)), "\n";
}

Note that I use || 0 to ensure that a missing entry actually shows up as a zero entry.

And, finally, for that sticky part. I have to translate the timestamp into an hour epoch value. Again, Data::Manip is helpful here. I edit the string so that the minutes and seconds are zeroed out, then pass that string to ParseDateString to extract the value. However, if I did that on every hit, the time would be prohibitively expensive. So I'll cache the result like so:

BEGIN {
  my %cache;

  sub date_string_to_epoch_key {
    my $stamp = shift;
    $stamp =~ s/(\d\d):\d\d:\d\d /$1:00:00 /; # reduce to hour
    $cache{$stamp} ||= UnixDate(ParseDateString($stamp), "%s");
  }
}

If a given $stamp exists as a key in %cache hash, then the ||= operator stops, returning that value. However, if the key is missing, then the (expensive) expression on the right produces the value, which is both stored into the hash element and returned! Very nice idiom, although it fails if the cached value just happens to be zero, since the expensive expression will be recalculated each time.

And the end result is a file on standard output looking a lot like:

Date,Miss,Hit,Other
05/10/2002 01:00:00,30,2,26
05/10/2002 02:00:00,273,15,183
05/10/2002 03:00:00,224,14,108
05/10/2002 04:00:00,201,7,122
05/10/2002 05:00:00,558,28,474
05/10/2002 06:00:00,615,74,329
05/10/2002 07:00:00,819,61,352
05/10/2002 08:00:00,605,85,450

which my spreadsheet program pulls up just fine. Or, I could even import it into a database and perform more complex queries. But at least now I know how successful my proxy cache is at avoiding hits to the fat backend processes. Until next time, enjoy!

Randal L. Schwartz is a two-decade veteran of the software industry -- skilled in software design, system administration, security, technical writing, and training. He has coauthored the "must-have" standards: Programming Perl, Learning Perl, Learning Perl for Win32 Systems, and Effective Perl Programming, as well as writing regular columns for WebTechniques and Unix Review magazines. He's also a frequent contributor to the Perl newsgroups, and has moderated comp.lang.perl.announce since its inception. Since 1985, Randal has owned and operated Stonehenge Consulting Services, Inc.