Cover V09, Ixx
Article

feb2000.tar


Deep Copying, not Deep Secrets

Randal L. Schwartz

One of the modules I encountered in the CPAN had a problem with creating multiple objects. Each object was created fine, but on further use simultaneously in the program, some of the data from one object mysteriously showed up in the second one, even if the first one was freed!

Upon inspection, I found that the object was being initialized from a shallow copy of a template, and I told the author that he needed to deep copy the template instead. He was puzzled by the suggestion, and if you aren't familiar with these two terms, You may be a little confused now as well.

What is deep copying and why do we need it? I'll start with a simple example, and work back to the problem I presented.

For example, grab all the password information into a simple hash, with the key being the username, and the value being an arrayref pointing to the nine element value returned by getpwent(). On a first cut, we quickly hack out something like this:

while (@x = getpwent()) {
  $info{$x[0]} = \@x;
}
for (sort keys %info) {
  print "$_ => @{$info{$_}}\n"
}

What? Where did all the data go? We stored a reference to the data into the hash value. Well, maybe this will make it more clear:

while (@x = getpwent()) {
  $info{$x[0]} = \@x;
  print "$info{$x[0]}\n";
}

On my machine, this printed ARRAY(0x80ce7fc) dozens of times -- once for every user on the system. So, what does that mean? It means that we are reusing the same array repeatedly, and therefore we don't have dozens of arrayrefs pointing to distinct arrays; we have a single array with many pointers to the same data. So on the last pass through the gathering loop, @x is emptied, and therefore all the array references are pointing to the identical empty data.

This is because we made a shallow copy of the reference to @x into the hash value, which is a copy of only the top-level pointer, but not the contents. What we really needed was not only a copy of the reference, but a copy of to what the reference pointed. That's simple enough here:

while (@x = getpwent()) {
  $info{$x[0]} = [@x];
}
for (sort keys %info) {
  print "$_ => $info{$_} => @{$info{$_}}\n"
}

Notice, we've got a distinct arrayref for each hash element, pointing to an independent copy of the nine elements of the array originally contained in @x. This worked because we created a new anonymous arrayref with the expression [@x], which also gives this anonymous array an initial value made of copies of the elements of @x.

So that's a basic deep copy -- copying not only the top level pointer, but also all the things within the data structure to maintain complete independence.

Actually, there was one other way to ensure unique sub-elements in this example, and I'll show it for completeness lest my Perl hacking friends get irritated. You don't need to copy anything if you just generate the data in a distinct array in the first place:

while (my @x = getpwent()) {
  $info{$x[0]} = \@x;
}
for (sort keys %info) {
  print "$_ => $info{$_} => @{$info{$_}}\n"
}

Here, each pass through the loop starts with a brand-new completely distinct lexical @x rather than reusing the old existing variable. So, when a reference is taken to it, and it falls out of scope at the bottom of the loop -- the variable automatically remains behind as an anonmous variable.

But let's get back to deep copying. Here's another example. Suppose Fred and Barney are sharing a house:

$place = {
  Street => "123 Shady Lane",
  City => "Granite Junction",
};
$fred = {
  First_name => "Fred",
  Last_name => "Flintstone",
  Address => $place,
};
$barney = {
  First_name => "Barney",
  Last_name => "Rubble",
  Address => $place,
};

Note that $fred->{Address}{City} is “Granite Junction”, just as we might expect it, as is $barney->{Address}{City}. But we've done a shallow copy from $place into both of the Address element values. This means that there's not two copies of the data, but just one. We can see this when we change one of the values. Let's let Fred move to his own place:

$fred->{Address}{Street} = "456 Palm Drive";
$fred->{Address}{City} = "Bedrock";

Looks safe enough. But what happened to Barney? He moved along with Fred!

print "@{$barney->{Address}}{qw(Street City)}\n";

This prints Fred's new address! Why did that happen? Once again, the assignment of $place as the address in both cases made a shallow copy: both data structures shared a common pointer to the common street and city data. Again, a deep copy would have helped:

$place = {
  Street => "123 Shady Lane",
  City => "Granite Junction",
};
$fred = {
  First_name => "Fred",
  Last_name => "Flintstone",
  Address => {%$place},
};
$barney = {
  First_name => "Barney",
  Last_name => "Rubble",
  Address => {%$place},
};
$fred->{Address}{Street} = "456 Palm Drive";
$fred->{Address}{City} = "Bedrock";
print "@{$barney->{Address}}{qw(Street City)}\n";

Now each Address field is a completely disconnected copy, so when we update one, the other stays pure. This works because just like the [@x] construct, we are creating a new independent anonymous hash and taking a reference to it.

But what if $place was itself a deeper structure? That is, suppose the street address was made up of a number and a name:

$place = {
  Street => {
    Number => 123,
    Name => "Shady Lane",
  },
  City => "Granite Junction",
};
$fred = {
  First_name => "Fred",
  Last_name => "Flintstone",
  Address => {%$place},
};
$barney = {
  First_name => "Barney",
  Last_name => "Rubble",
  Address => {%$place},
};

We've now done something that's not quite a deep copy, but also not a shallow copy. Certainly, the hash at $fred->{Address} is different from $barney->{Address}. But they both contain a value that is identical to the $place->{Street} hashref! So if we move Fred just down the street:

$fred->{Address}{Street}{Number} = 456;

then Barney moves along with him again! Now, we could fix this problem by applying the logic for copying the address one more time to the street structure:

$fred = {
  First_name => "Fred",
  Last_name => "Flintstone",
  Address => {
    Street => {%{$place->{Street}}},
    City => $place->{City},
  },
};

As you can see, it's getting more and more convoluted. And what if we change City to be another structure, or added another level to Street. Bleh.

Fortunately, we can write a simple general-purpose deep copier with a recursive subroutine. Here's a simple little deep copy routine:

sub deep_copy {
  my $this = shift;
  if (not ref $this) {
    $this;
  } elsif (ref $this eq "ARRAY") {
    [map deep_copy($_), @$this];
  } elsif (ref $this eq "HASH") {
    +{map { $_ => deep_copy($this->{$_}) } keys %$this};
  } else { die "what type is $_?" }
}

This subroutine expects a single item -- the top of a tree of hashrefs and listrefs and scalars. If the item is a scalar, it is simply returned, since a shallow copy of a scalar is also a deep copy. If it's an arrayref, we create a new anonymous array from the data. However, each element of this array could itself be a data structure, so we need a deep copy of it. The solution is straightforward -- simply call deep_copy on each item. Similarly, a new hashref is constructed by copying each element, including a deep copy of its value. (The hash key is always a simple scalar, so it needs no copy, although that would have been easy enough to add.) To see it work, let's give some data:

$place = {
  Street => {
    Number => 123,
    Name => [qw(Shady Lane)],
  },
  City => "Granite Junction",
  Zip => [97007, 4456],
};
$place2 = $place;
$place3 = {%$place};
$place4 = deep_copy($place);

Hmm. How do we see what we've done, and what's being shared? Let's add a call to the standard library module, Data::Dumper:

use Data::Dumper;
$Data::Dumper::Indent = 1;
print Data::Dumper->Dump(
  [$place, $place2, $place3, $place4],
  [qw(place place2 place3 place4)]
);

And that generates on my system:

$place = {
  'City' => 'Granite Junction',
  'Zip' => [
    97007,
    4456
  ],
  'Street' => {
    'Name' => [
      'Shady',
      'Lane'
    ],
    'Number' => 123
  }
};
$place2 = $place;
$place3 = {
  'City' => 'Granite Junction',
  'Zip' => $place->{'Zip'},
  'Street' => $place->{'Street'}
};
$place4 = {
  'City' => 'Granite Junction',
  'Zip' => [
    97007,
    4456
  ],
  'Street' => {
    'Number' => 123,
    'Name' => [
      'Shady',
      'Lane'
    ]
  }
};

Hey, look at that. Data::Dumper let me know that $place2 is a shallow copy of $place, while $place3 is an intermediately copied value. Notice the elements of $place inside $place3. And since $place4 contains no previously seen references, we know it's a completely independent deep copy. Success! (The ordering of the hash elements is inconsistent, but that's immaterial and undetectable in normal use.)

Now, this simple deep_copy routine will break if there are recursive data pointers (references that point to already seen data higher in the tree). For that, you might look at the dclone method of the Storable module, found in the CPAN.

So, when you use = with a structure, be sure you know what you are doing. You may need a deep copy instead of that shallow copy. For further information, check out your online documentation with perldoc perldsc and perldoc perllol and even perldoc perlref and perldoc perlreftut for the basics. Until next time, enjoy!

About the Author

Randal L. Schwartz is a two-decade veteran of the software industry -- skilled in software design, system administration, security, technical writing, and training. He has coauthored the “must-have” standards: Programming Perl, Learning Perl, Learning Perl for Win32 Systems, and Effective Perl Programming, as well as writing regular columns for WebTechniques and Unix Review magazines. He's also a frequent contributor to the Perl newsgroups, and has moderated comp.lang.perl.announce since its inception. His offbeat humor and technical mastery have reached legendary proportions worldwide (but he probably started some of those legends himself). Randal's desire to give back to the Perl community inspired him to help create and provide initial funding for The Perl Institute (perl.org). He is also a founding board member of the Perl Mongers (pm.org), the worldwide association of “Perl users groups”. Since 1985, Randal has owned and operated Stonehenge Consulting Services, Inc.