Cover V11, I11

Article

nov2002.tar

Automatically Extending Your Data

Randal L. Schwartz

Perl is great at parsing data and bringing it into memory-based data structures for reformatting or analysis ("data reduction"). One of Perl's features that permits relatively easy creation of complex data structures is "auto-vivification" -- a mouthful to say, but it roughly means "data structures get expanded as necessary".

A frequent first reaction when I present auto-vivification in the courses I teach is "isn't that dangerous?" and "how can I turn that off?" and "doesn't that violate 'use strict'?". Well, the answers are "no" and "you don't" and "no". But, let me explain by first going back to something simpler.

Ever since the earliest versions of Perl, you've been able to say:

$count[3] = 14;
as the first statement of your program. What happens here is that Perl realizes that @count doesn't exist, and creates it. This new @count isn't long enough to hold four elements yet (elements 0 through 3), so Perl extends the array to include those elements, and puts the value 14 at $count[3].

Similarly, the code:

$seen{'Dino'} = 7;
has also always worked, first creating the hash %seen if it didn't exist, then adding a hash element with a key of Dino, and finally putting 7 as the corresponding value.

The value of this "automatic creation of variables" is that I don't need to first predetermine all the possible index values for a given hash or array, create the structure properly, and then run my program. I can simply run my program, and the data structures will expand as necessary to hold the values.

Admittedly, language purists cringe at even this Perl capability, but let's just presume that they don't understand that the "P" in Perl stands for "Practical" (at least in one telling of the story).

When Perl references were added mid-way through Perl's life, Larry Wall extended the definition of this auto-extension of variables to include not-yet-defined references, and this is the action defined as "auto-vivification". This is best illustrated as an example:

$myreference = undef;
$myreference->[3] = 14;
Here, $myreference is being treated as an array reference in the second statement. However, at the moment, that's undef. But just as Perl creates variables where necessary, and extends arrays and hashes as needed, Perl will also plug in a pointer to an empty anonymous array here. It's as if these statements were rewritten as:

$myreference = undef;
$myreference = []; # inserted via autovificiation
$myreference->[3] = 14;
Now the 14 is being inserted as the fourth element (element index 3) of the anonymous array, pointed to by the $myreference variable.

Again, this is really just a continuation of the prior behavior: "extend the data structures as necessary so that the show can go on". Formally, the rule is:

=over 4
If a variable containing undef is being used in an assignment as if it were a reference to a data structure, a reference to an empty data structure of the appropriate type is placed into that variable before the operation continues:

=back
And the result is that we create data structures as needed. For example, this also works:

$myreference = undef;
$myreference->{'Dino'} = 7;
Note, however, that we'll end up with a hash reference in $myreference, not an array reference. This hash reference initially points at an anonymous empty hash, which is then nearly immediately extended to include an element with a key of Dino and a value of 7.

The type of reference is determined by the type of the object we're trying to point at, not by the previous contents of the variable. In fact, the previous contents of the variable must be undef, or the rule given above doesn't apply. So, this sequence is guaranteed to fail:

$myreference = undef;
$myreference->[3] = 14;
$myreference->{'Dino'} = 7;  # fails
We're trying to use the now-present array reference in $myreference as if it were a hash reference. This can't work (ignoring the soon-to-be-removed pseudo-hash feature, anyway), and will throw a runtime exception.

The examples above deliberately put an undef into the variable, but the undef that is present in a newly created variable would have worked just as well:

my $newreference;
$newreference->[3] = 14;
And recall that a new element of an array or hash also has this same sort of undef:

my @pointers;
$pointers[42]->{'Dino'} = 7;
Here, $pointers[42] doesn't exist, so Perl first extends the @pointers array to include that element. But then the element is being used as if it were a hash ref, so Perl places an anonymous hash reference into $pointers[42], and continues the operation. If we consistently placed only hash references into this array, we'd have a dynamically allocated array of hashrefs.

Of course, you can drop that arrow, because it's between two "subscript-y kind of things" (technical terms), so it's more commonly written as $pointers[42]{'Dino'}. And even the quotes aren't necessary there, since the hash element is an alphanumeric symbol, so we can reduce that further to $pointers[42]{Dino} safely.

An action might invoke multiple levels of auto-vivification. For example, let's look at the following code:

my $source = "red";
my $destination = "yellow";
my $length = 35;

$lengths{$source}{$destination} = $length;
The hash element $lengths{red} is being used as a hash reference, de-referenced, and the element with a key of yellow of that hash is being given the value 35. Now, if these are the first few steps of the program, %lengths won't even exist, so it first gets created. Then, since $lengths{red} doesn't exist, it gets installed with a value of a reference to an empty hash (via auto-vivification). Finally, the element with a key of yellow in that hash is given the value of 35, and we're done. This is more commonly encountered in a loop:

while (<DATA>) {
  my ($source, $destination, $length) = split;
  $lengths{$source}{$destination} = $length;
}
# more code here later
__END__
red yellow 35
red green 19
purple blue 12
blue orange 18
Note that once the first line is processed, creating a hash reference for $lengths{red}, the second line doesn't create a new hash reference, because $lengths{red} is already defined. So the elements with keys of yellow and green are both in the same hash, referenced by the hash element of $lengths{red}.

A variant on this for tabulation purposes involves the automatic initialization to undef for a variable with respect to an operator like +=. For example, the following code sums a list of numbers:

while (<DATA>) {
  my ($number) = split;
  $sum += $number;
}
print "$sum\n";
__END__
3
5
19
The first time through the loop, $sum is uninitialized, and therefore guaranteed to be undef, but this happens to be the perfect base value for +=, treating the undef like a 0 because addition is a mathematical operation. We can apply this to a complex data reduction:

while (<DATA>) {
  my ($source, $destination, $hits) = split;
  $total_hits{$source}{$destination} += $hits;
}
# more code here later
__END__
red yellow 35
red green 19
red yellow 12
blue red 18
blue red 8
Just like the previous summing example, we'll now be adding up a summation. But we're summing the totals organized by the pair of source crossed with destination. Looking at the first invocation:

$total_hits{red}{yellow} += 35;
Since %total_hits is empty at this point, Perl first extends the hash to include a hashref at $total_hits{red}. This hashref initially points to an empty hash, but then gets extended to include an element at the key of yellow. However, since the value at this key is being used in a +=, the initial undef value is treated as 0, and then 35 gets added, resulting in 35. This 35 is then stored in place of the initial undef, and we're done. When the third step is executed:

$total_hits{red}{yellow} += 12;
the value of 35 is added to 12, yielding 47, and that becomes the updated value.

The important point here is that you write what you want it to do, and it just works. That's the nice thing about Perl. It very often just Does The Right Thing. So, be mystified by auto-vivification no more: learn to embrace it, use it, and like it! Until next time, enjoy!

Randal L. Schwartz is a two-decade veteran of the software industry -- skilled in software design, system administration, security, technical writing, and training. He has coauthored the "must-have" standards: Programming Perl, Learning Perl, Learning Perl for Win32 Systems, and Effective Perl Programming, as well as writing regular columns for WebTechniques and Unix Review magazines. He's also a frequent contributor to the Perl newsgroups, and has moderated comp.lang.perl.announce since its inception. Since 1985, Randal has owned and operated Stonehenge Consulting Services, Inc.