Cover V09, I12
Article

dec2000.tar


So What's the Difference?

Randal L. Schwartz

A lot of common programming is dealing with things that change. Things do indeed change, and sometimes we'd like to know how they changed.

For example, if we have a list of items:

@one = qw(a b c d e f g);
then later, we look at it again, and there's a different set of items:

@two = qw(b c e h i j);
How can we tell what's new, what's old, and what's gone? We could certainly try to do it by brute force:

@one = qw(a b c d e f g);
@two = qw(b c e h i j);
foreach $one (@one) {
  if (grep $one eq $_, @two) {
    print "$one is in both old and new\n";
  } else {
    print "$one has been deleted\n";
  }
}
foreach $two (@two) {
  unless (grep $two eq $_, @one) {
     print "$two has been added\n";
  }
}
This in fact gives us an appropriate response:

a has been deleted
b is in both old and new
c is in both old and new
d has been deleted
e is in both old and new
f has been deleted
g has been deleted
h has been added
i has been added
j has been added
But this is incredibly inefficient. The computation time will rise in proportion to the product of sizes of both the lists. This happens because every element of one list is being compared to every element of the other list (twice, in fact). The grep operator is a loop over each item, so we effectively have nested loops, which should usually be a danger sign.

The perlfaq4 manpage approaches this subject, giving a solution of something like:

  @union = @intersection = @difference = ();
  %count = ();
  foreach $element (@one, @two) { $count{$element}++ }
  foreach $element (keys %count) {
      push @union, $element;
      push @{ $count{$element} > 1 ? \@intersection : \
             \@difference }, $element;
  }
with the caveat that we're assuming one item of each kind within each list. While that works for our input data as well, we'll run into trouble on more general data. However, with a slight modification, we can handle even duplicate items in each list:

@one = qw(a a a a b c d e f g);
@two = qw(b c e h i i i i j);
my %tracker = ();
$tracker{$_} .= 1 for @one;
$tracker{$_} .= 2 for @two;
for (sort keys %tracker) {
  if ($tracker{$_} !~ /1/) {
    print "$_ has been added\n";
  } elsif ($tracker{$_} !~ /2/) {
    print "$_ has been deleted\n";
  } else {
    print "$_ is in both old and new\n";
  }
}
Success. Correct output, and it's reasonably efficient. If you're doing a lot of these, check into the CPAN modules starting with Set::.

Then we come to the problem of telling the difference between two sequences, where the ordering matters. The very nice Algorithm::Diff in the CPAN computes a reasonably short difference-list, similar to the UNIX diff command, to tell us how to transform one list into another. There are a number of interfaces.

The most interesting one I found was traverse_sequences, which gives me all of the elements of the two lists in sequence, but marked in a way that I can tell to which of the two lists (or both) the item belongs.

Let's look at a simple example:

use Algorithm::Diff qw(traverse_sequences);
@one = qw(M N a b P Q c d e f V W g h);
@two = qw(a b R S c d T U e f g h X Y);
traverse_sequences(\@one, \@two, {
  MATCH => sub { show($one[$_[0]], $two[$_[1]]) },
  DISCARD_A => sub { show($one[$_[0]], "---") },
  DISCARD_B => sub { show("---", $two[$_[1]]) },
});
sub show {
  printf "%10s %10s\n", @_;
}
Here, we've given two token sequences in @one and @two. Using traverse_sequences, we'll print out common sequences (via the MATCH callback, removed material (via the DISCARD_A callback), and new material (via the DISCARD_B callback). Changed material shows up as a series of deletes followed by a series of inserts.

The callbacks are defined as references to anonymous subroutines, more commonly known as “coderefs”. The two parameters passed to each of the callbacks are the current indicies within the @one and @two arrays. As this isn't the actual value, I need to take the index and look it up in the appropriate array.

The result is something like:

  M        ---
  N        ---
  a          a
  b          b
  P        ---
  Q        ---
---          R
---          S
  c          c
  d          d
---          T
---          U
  e          e
  f          f
  V        ---
  W        ---
  g          g
  h          h
---          X
---          Y
Notice the common sequences. The printf operation lines up the columns nicely.

Well, this is a nice text-mode tabular output, but we can get a bit nicer if we know we're sending the result to HTML. Let's color-code all deletions in red, and insertions in green.

A first cut at the algorithm generates far too many font tags:

use Algorithm::Diff qw(traverse_sequences);
@one = qw(M N a b P Q c d e f V W g h);
@two = qw(a b R S c d T U e f g h X Y);
traverse_sequences(\@one, \@two, {
  MATCH => sub { colorshow("", $one[$_[0]]) },
  DISCARD_A => sub { colorshow("red", $one[$_[0]]) },
  DISCARD_B => sub { colorshow("green", $two[$_[1]]) },
});
sub colorshow {
  my $color = shift;
  my $string = shift;
  if (length $color) {
    print "<font color=$color>$string</font>\n";
  } else {
    print "$string\n";
  }
}
This generates a correct result, but excessive output:

<font color=red>M</font>
<font color=red>N</font>
a
b
<font color=red>P</font>
<font color=red>Q</font>
<font color=green>R</font>
<font color=green>S</font>
c
d
<font color=green>T</font>
<font color=green>U</font>
e
f
<font color=red>V</font>
<font color=red>W</font>
g
h
<font color=green>X</font>
<font color=green>Y</font>
What we need is some tracking of state information to figure out if we're already in red or green mode:

use Algorithm::Diff qw(traverse_sequences);
@one = qw(M N a b P Q c d e f V W g h);
@two = qw(a b R S c d T U e f g h X Y);
traverse_sequences(\@one, \@two, {
  MATCH => sub { colorshow("", $one[$_[0]]) },
  DISCARD_A => sub { colorshow("red", $one[$_[0]]) },
  DISCARD_B => sub { colorshow("green", $two[$_[1]]) },
});
colorshow(""); # reset back to
BEGIN {
  my $currentcolor = "";

  sub colorshow {
    my $color = shift;
    my $string = shift;
    if ($color ne $currentcolor) {
      print "</font>\n" if length $currentcolor;
      print "<font color=$color>\n" if length $color;
      $currentcolor = $color;
    }
    if (defined $string and length $string) {
      print “$string\n”;
    }
  }
}
Here, I'm tracking the state of the current HTML color in the $currentcolor static variable. As it changes, I send out the end-font or begin-font tags as needed. The only oddness now is that I need to make one final call to colorshow with the uncolored tag to close off any final begin-font tag. This call would be harmless if we were already outside a colored region.

That's much better, resulting in:

<font color=red>
M
N
</font>
a
b
<font color=red>
P
Q
</font>
<font color=green>
R
S
</font>
c
d
<font color=green>
T
U
</font>
e
f
<font color=red>
V
W
</font>
g
h
<font color=green>
X
Y
</font>
Although my Web-hacking friends might prefer to see that as:

<span style="background: red; color: black">
M
N
</span>
a
b
<span style="background: red; color: black">
P
Q
</span>
<span style="background: green; color: black">
R
S
</span>
c
d
<span style="background: green; color: black">
T
U
</span>
e
f
<span style="background: red; color: black">
V
W
</span>
g
h
<span style="background: green; color: black">
X
Y
</span>
That'd be a pretty easy small change, but I'll leave that to you. There's a little more whitespace in the output here than I like, but at least the job is getting done with minimal hassle.

So, now when someone asks you “what's the difference?”, you can show different ways of answering that question! Until next time, enjoy!

About the Author

Randal L. Schwartz is a two-decade veteran of the software industry -- skilled in software design, system administration, security, technical writing, and training. He has coauthored the “must-have” standards: Programming Perl, Learning Perl, Learning Perl for Win32 Systems, and Effective Perl Programming, as well as writing regular columns for WebTechniques and Unix Review magazines. He's also a frequent contributor to the Perl newsgroups, and has moderated comp.lang.perl.announce since its inception. Since 1985, Randal has owned and operated Stonehenge Consulting Services, Inc.