Finding
Old Things
Randal L. Schwartz
One of the great things about the Perl community is seen in the
many ways that Perl information can be obtained to solve typical
tasks or to get past those sticky little problems. Besides the professional
documentation, there's also the myriad of man pages that come
with the Perl distribution itself, and the CPAN modules. And with
a little searching on the Net, you can quickly find very active
support for Perl in the way of Usenet newsgroups, mailing lists,
and a few Web communities.
The most active Perl Web community that I participate in is known
as the Perl Monastery, at http://perlmonks.org. Each day,
a few hundred active users and a few thousand visitors post and
answer questions and chat in the "chatterbox". In my past
role as a systems administrator, it seems like I was always under
pressure to solve problems in annoyingly brief amounts of time.
Therefore, the Monastery is a welcome resource, because questions
often get answered within minutes, so help is just a browser-reload
or two away.
Recently, a relatively new Monk (as we call the participants)
who goes by the name "Aquilo" asked for help with a script
that:
recurses through a directory structure and checks if more than
half of the files in that directory have been used in the past 180
days. The path of directories which are predominately unused is
appended to a list of directories which will be used to archive
them.
Aquilo gave some sample code that performed directory recursion
in the traditional beginner way. I say traditional, because one
of Perl's "rites of passage" seems to be to "write
a directory recursion routine". Generally, these solutions
are not as flexible or efficient or portable as Perl's built-in
File::Find module, but beginners generally aren't aware
of this module.
A typical hand-rolled directory recursion starts like so:
sub do_this {
my ($dir) = shift;
if (opendir DIR, $dir) { # it's a directory
for my $entry (sort readdir DIR) {
do_this("$dir/$entry"); # recurse
}
} else { # it's a file
# operate on $dir (not shown)
}
}
do_this("fred"); # our top-level directory
and when that gets typed in, and it runs forever, the beginner will
start to scratch his head wondering why. Well, the basic structure
is correct, except that the readdir call in the fourth line
returns not only the files contained within the directory, but also
the directories contained within the directory. That's good for
the most part, but we also get the always-present "dot"
and "dotdot" entries pointing at the current directory and
the parent directory.
This means that while we are processing our first directory (say
fred), we'll also recurse to process that directory
followed by slash-dot (fred/.), which processes the same
directory again, which then recurses to process that new name followed
by slash-dot (fred/./.), forever. Ooops! Similarly, we'll
also process the parent directory, which then contains this directory
as an entry. Even the fastest supercomputer cannot process an infinite
loop in less than infinite time.
So, the next move the beginner usually takes is to strip out the
dot files, or maybe just dot and dot-dot. Something like this:
sub do_this {
my ($dir) = shift;
if (opendir DIR, $dir) { # it's a directory
for my $entry (sort readdir DIR) {
next if $entry eq "." or $entry eq "..";
do_this("$dir/$entry"); # recurse
}
} else { # it's a file
# operate on $dir (not shown)
}
}
do_this("fred"); # our top-level directory
and this is a little better. For the most part, everything runs fine,
but we still run into trouble when we hit a symbolic link (or "symlink").
If the symlink points to a directory in a parallel part of the tree,
that's usually OK. But if it points to a directory deeper in
the current tree, we'll process that portion twice: once as a
result of following the symlink, and once when we actually get there
by normal directory recursion.
And then there's the possibility of a symlink pointing to
a directory above our current directory. It's just as
bad as following dot-dot -- we'll process the new directory
recursively, coming right back down to where we've already
gone. Infinite loop-time again.
So, the next necessary refinement is often "ignore symbolic
links":
sub do_this {
my ($dir) = shift;
if (opendir DIR, $dir) { # it's a directory
for my $entry (sort readdir DIR) {
next if $entry eq "." or $entry eq "..";
next if -l "$dir/$entry";
do_this("$dir/$entry"); # recurse
}
} else { # it's a file
# operate on $dir (not shown)
}
}
do_this("fred"); # our top-level directory
And there you have a perfectly fine directory recursion routine. As
long as you're running on UNIX, and not MacOS, VMS, OS/2, or
Windows, because the step that creates $dir/$entry is wrong
for those others. And, there are some speed-optimization steps to
keep from recursing into directories that have no subdirectories that
we haven't even begun to consider here. Luckily, that's
all done for us in the File::Find module, so let's get
back to solving the issue raised by Aquilo.
First, we'll pull in the module:
use File::Find;
Next, let's set up a hash (actually, a nested hash):
my %ages;
We'll use this to keep track of the number of items in the directory
that are both newer and older than 180 days. We'll make each
key in %ages be the full directory path name. The corresponding
value will be a hashref pointing to a hash with two keys in it: old
and new. The values of those two entries will be a running
count of the number of files of that category in that directory.
Now the easy and fun part -- calling File::Find::find:
find \&wanted, "fred";
The find routine expects a reference to a subroutine ("coderef")
as its first parameter. It then descends into all of the directories
listed as the remaining arguments, in this case, just fred.
The subroutine gets called for every entry below the listed directories,
with a few parameters set. Let's take a look at wanted
now:
sub wanted {
return unless -f $_;
my $old_flag = -M $_ > 180 ? 'old' : 'new';
$ages{$File::Find::dir}{$old_flag}++;
}
A lot is going on in these few lines. First, the subroutine returns
quickly if we're not looking at a file. The value of $_
is the basename of the current entry (like $entry in the hand-rolled
version earlier), and our current directory is the directory being
examined. Next, $old_flag is set to either old or new,
based on looking at the file modification time. Finally, the hash
referenced by the value of $ages{$File::Find::dir} has one
of its two entries incremented, based on this $old_flag. The
value of $File::Find::dir is the full pathname to our current
directory. So, if a file in fred/a named dino were newer
than 180 days, then ${"fred/a"}{new} would be incremented,
showing that we have a new file.
After the directory recursion pass is complete, we merely have
to walk the data to see what's old enough now:
for my $dir (sort keys %ages) {
if ($ages{$dir}{old} > $ages{$dir}{new}) {
print "$dir has more old than new\n";
}
}
And there you have it! For all directories that have more old entries
than new entries, we'll get a report.
On that last comparison, we'll get warnings if -w
or use warnings is enabled, because some directories have
new items but not old, or old items but not new. For a small program
like this, I probably wouldn't bother enabling warnings, but
if you're a stickler for that, then you might also want to
clean it up a bit before comparing:
for my $dir (sort keys %ages) {
my $old = $ages{$dir}{old} || 0;
my $new = $ages{$dir}{new} || 0;
if ($old > $new and $old > 50) {
print "$dir has more old than new\n";
}
}
And notice, while I was cleaning up, I couldn't resist tinkering
a bit, making it so that directories with fewer than 50 old items
are not reported -- just one more thing that was made slightly
easier.
And there you have it -- a typical systems administration
task hacked out in a few lines of code, and a reference to a great
resource. Until next time, enjoy!
Randal L. Schwartz is a two-decade veteran of the software
industry -- skilled in software design, system administration,
security, technical writing, and training. He has coauthored the
"must-have" standards: Programming Perl, Learning
Perl, Learning Perl for Win32 Systems, and Effective
Perl Programming, as well as writing regular columns for WebTechniques
and Unix Review magazines. He's also a frequent contributor
to the Perl newsgroups, and has moderated comp.lang.perl.announce
since its inception. Since 1985, Randal has owned and operated Stonehenge
Consulting Services, Inc.
|