Cover V11, I09

Article

sep2002.tar

Finding Old Things

Randal L. Schwartz

One of the great things about the Perl community is seen in the many ways that Perl information can be obtained to solve typical tasks or to get past those sticky little problems. Besides the professional documentation, there's also the myriad of man pages that come with the Perl distribution itself, and the CPAN modules. And with a little searching on the Net, you can quickly find very active support for Perl in the way of Usenet newsgroups, mailing lists, and a few Web communities.

The most active Perl Web community that I participate in is known as the Perl Monastery, at http://perlmonks.org. Each day, a few hundred active users and a few thousand visitors post and answer questions and chat in the "chatterbox". In my past role as a systems administrator, it seems like I was always under pressure to solve problems in annoyingly brief amounts of time. Therefore, the Monastery is a welcome resource, because questions often get answered within minutes, so help is just a browser-reload or two away.

Recently, a relatively new Monk (as we call the participants) who goes by the name "Aquilo" asked for help with a script that:

recurses through a directory structure and checks if more than half of the files in that directory have been used in the past 180 days. The path of directories which are predominately unused is appended to a list of directories which will be used to archive them.

Aquilo gave some sample code that performed directory recursion in the traditional beginner way. I say traditional, because one of Perl's "rites of passage" seems to be to "write a directory recursion routine". Generally, these solutions are not as flexible or efficient or portable as Perl's built-in File::Find module, but beginners generally aren't aware of this module.

A typical hand-rolled directory recursion starts like so:

sub do_this {
  my ($dir) = shift;
  if (opendir DIR, $dir) { # it's a directory
    for my $entry (sort readdir DIR) {
     do_this("$dir/$entry"); # recurse
     }
  } else { # it's a file
    # operate on $dir (not shown)
  }
}
do_this("fred"); # our top-level directory
and when that gets typed in, and it runs forever, the beginner will start to scratch his head wondering why. Well, the basic structure is correct, except that the readdir call in the fourth line returns not only the files contained within the directory, but also the directories contained within the directory. That's good for the most part, but we also get the always-present "dot" and "dotdot" entries pointing at the current directory and the parent directory.

This means that while we are processing our first directory (say fred), we'll also recurse to process that directory followed by slash-dot (fred/.), which processes the same directory again, which then recurses to process that new name followed by slash-dot (fred/./.), forever. Ooops! Similarly, we'll also process the parent directory, which then contains this directory as an entry. Even the fastest supercomputer cannot process an infinite loop in less than infinite time.

So, the next move the beginner usually takes is to strip out the dot files, or maybe just dot and dot-dot. Something like this:

sub do_this {
  my ($dir) = shift;
  if (opendir DIR, $dir) { # it's a directory
    for my $entry (sort readdir DIR) {
      next if $entry eq "." or $entry eq "..";
      do_this("$dir/$entry"); # recurse
    }
  } else { # it's a file
    # operate on $dir (not shown)
  }
}
do_this("fred"); # our top-level directory
and this is a little better. For the most part, everything runs fine, but we still run into trouble when we hit a symbolic link (or "symlink"). If the symlink points to a directory in a parallel part of the tree, that's usually OK. But if it points to a directory deeper in the current tree, we'll process that portion twice: once as a result of following the symlink, and once when we actually get there by normal directory recursion.

And then there's the possibility of a symlink pointing to a directory above our current directory. It's just as bad as following dot-dot -- we'll process the new directory recursively, coming right back down to where we've already gone. Infinite loop-time again.

So, the next necessary refinement is often "ignore symbolic links":

sub do_this {
  my ($dir) = shift;
  if (opendir DIR, $dir) { # it's a directory
    for my $entry (sort readdir DIR) {
      next if $entry eq "." or $entry eq "..";
      next if -l "$dir/$entry";
      do_this("$dir/$entry"); # recurse
    }
  } else { # it's a file
    # operate on $dir (not shown)
  }
}
do_this("fred"); # our top-level directory
And there you have a perfectly fine directory recursion routine. As long as you're running on UNIX, and not MacOS, VMS, OS/2, or Windows, because the step that creates $dir/$entry is wrong for those others. And, there are some speed-optimization steps to keep from recursing into directories that have no subdirectories that we haven't even begun to consider here. Luckily, that's all done for us in the File::Find module, so let's get back to solving the issue raised by Aquilo.

First, we'll pull in the module:

use File::Find;
Next, let's set up a hash (actually, a nested hash):

my %ages;
We'll use this to keep track of the number of items in the directory that are both newer and older than 180 days. We'll make each key in %ages be the full directory path name. The corresponding value will be a hashref pointing to a hash with two keys in it: old and new. The values of those two entries will be a running count of the number of files of that category in that directory.

Now the easy and fun part -- calling File::Find::find:

find \&wanted, "fred";
The find routine expects a reference to a subroutine ("coderef") as its first parameter. It then descends into all of the directories listed as the remaining arguments, in this case, just fred. The subroutine gets called for every entry below the listed directories, with a few parameters set. Let's take a look at wanted now:

sub wanted {
  return unless -f $_;
  my $old_flag = -M $_ > 180 ? 'old' : 'new';
  $ages{$File::Find::dir}{$old_flag}++;
}
A lot is going on in these few lines. First, the subroutine returns quickly if we're not looking at a file. The value of $_ is the basename of the current entry (like $entry in the hand-rolled version earlier), and our current directory is the directory being examined. Next, $old_flag is set to either old or new, based on looking at the file modification time. Finally, the hash referenced by the value of $ages{$File::Find::dir} has one of its two entries incremented, based on this $old_flag. The value of $File::Find::dir is the full pathname to our current directory. So, if a file in fred/a named dino were newer than 180 days, then ${"fred/a"}{new} would be incremented, showing that we have a new file.

After the directory recursion pass is complete, we merely have to walk the data to see what's old enough now:

for my $dir (sort keys %ages) {
  if ($ages{$dir}{old} > $ages{$dir}{new}) {
    print "$dir has more old than new\n";
  }
}
And there you have it! For all directories that have more old entries than new entries, we'll get a report.

On that last comparison, we'll get warnings if -w or use warnings is enabled, because some directories have new items but not old, or old items but not new. For a small program like this, I probably wouldn't bother enabling warnings, but if you're a stickler for that, then you might also want to clean it up a bit before comparing:

for my $dir (sort keys %ages) {
  my $old = $ages{$dir}{old} || 0;
  my $new = $ages{$dir}{new} || 0;
  if ($old > $new and $old > 50) {
    print "$dir has more old than new\n";
  }
}
And notice, while I was cleaning up, I couldn't resist tinkering a bit, making it so that directories with fewer than 50 old items are not reported -- just one more thing that was made slightly easier.

And there you have it -- a typical systems administration task hacked out in a few lines of code, and a reference to a great resource. Until next time, enjoy!

Randal L. Schwartz is a two-decade veteran of the software industry -- skilled in software design, system administration, security, technical writing, and training. He has coauthored the "must-have" standards: Programming Perl, Learning Perl, Learning Perl for Win32 Systems, and Effective Perl Programming, as well as writing regular columns for WebTechniques and Unix Review magazines. He's also a frequent contributor to the Perl newsgroups, and has moderated comp.lang.perl.announce since its inception. Since 1985, Randal has owned and operated Stonehenge Consulting Services, Inc.