Automatically
Extending Your Data
Randal L. Schwartz
Perl is great at parsing data and bringing it into memory-based
data structures for reformatting or analysis ("data reduction").
One of Perl's features that permits relatively easy creation
of complex data structures is "auto-vivification" --
a mouthful to say, but it roughly means "data structures get
expanded as necessary".
A frequent first reaction when I present auto-vivification in
the courses I teach is "isn't that dangerous?" and
"how can I turn that off?" and "doesn't that
violate 'use strict'?". Well, the answers are "no"
and "you don't" and "no". But, let me explain
by first going back to something simpler.
Ever since the earliest versions of Perl, you've been able
to say:
$count[3] = 14;
as the first statement of your program. What happens here is that
Perl realizes that @count doesn't exist, and creates it.
This new @count isn't long enough to hold four elements
yet (elements 0 through 3), so Perl extends the array to include those
elements, and puts the value 14 at $count[3].
Similarly, the code:
$seen{'Dino'} = 7;
has also always worked, first creating the hash %seen if it
didn't exist, then adding a hash element with a key of Dino,
and finally putting 7 as the corresponding value.
The value of this "automatic creation of variables"
is that I don't need to first predetermine all the possible
index values for a given hash or array, create the structure properly,
and then run my program. I can simply run my program, and the data
structures will expand as necessary to hold the values.
Admittedly, language purists cringe at even this Perl capability,
but let's just presume that they don't understand that
the "P" in Perl stands for "Practical" (at least
in one telling of the story).
When Perl references were added mid-way through Perl's life,
Larry Wall extended the definition of this auto-extension of variables
to include not-yet-defined references, and this is the action defined
as "auto-vivification". This is best illustrated as an
example:
$myreference = undef;
$myreference->[3] = 14;
Here, $myreference is being treated as an array reference in
the second statement. However, at the moment, that's undef.
But just as Perl creates variables where necessary, and extends arrays
and hashes as needed, Perl will also plug in a pointer to an empty
anonymous array here. It's as if these statements were rewritten
as:
$myreference = undef;
$myreference = []; # inserted via autovificiation
$myreference->[3] = 14;
Now the 14 is being inserted as the fourth element (element index
3) of the anonymous array, pointed to by the $myreference variable.
Again, this is really just a continuation of the prior behavior:
"extend the data structures as necessary so that the show can
go on". Formally, the rule is:
=over 4
If a variable containing undef is being used in an assignment
as if it were a reference to a data structure, a reference to an empty
data structure of the appropriate type is placed into that variable
before the operation continues:
=back
And the result is that we create data structures as needed. For example,
this also works:
$myreference = undef;
$myreference->{'Dino'} = 7;
Note, however, that we'll end up with a hash reference in $myreference,
not an array reference. This hash reference initially points at an
anonymous empty hash, which is then nearly immediately extended to
include an element with a key of Dino and a value of 7.
The type of reference is determined by the type of the object
we're trying to point at, not by the previous contents of the
variable. In fact, the previous contents of the variable must be
undef, or the rule given above doesn't apply. So, this
sequence is guaranteed to fail:
$myreference = undef;
$myreference->[3] = 14;
$myreference->{'Dino'} = 7; # fails
We're trying to use the now-present array reference in $myreference
as if it were a hash reference. This can't work (ignoring the
soon-to-be-removed pseudo-hash feature, anyway), and will throw a
runtime exception.
The examples above deliberately put an undef into the variable,
but the undef that is present in a newly created variable
would have worked just as well:
my $newreference;
$newreference->[3] = 14;
And recall that a new element of an array or hash also has this same
sort of undef:
my @pointers;
$pointers[42]->{'Dino'} = 7;
Here, $pointers[42] doesn't exist, so Perl first extends
the @pointers array to include that element. But then the element
is being used as if it were a hash ref, so Perl places an anonymous
hash reference into $pointers[42], and continues the operation.
If we consistently placed only hash references into this array, we'd
have a dynamically allocated array of hashrefs.
Of course, you can drop that arrow, because it's between
two "subscript-y kind of things" (technical terms), so
it's more commonly written as $pointers[42]{'Dino'}.
And even the quotes aren't necessary there, since the hash
element is an alphanumeric symbol, so we can reduce that further
to $pointers[42]{Dino} safely.
An action might invoke multiple levels of auto-vivification. For
example, let's look at the following code:
my $source = "red";
my $destination = "yellow";
my $length = 35;
$lengths{$source}{$destination} = $length;
The hash element $lengths{red} is being used as a hash reference,
de-referenced, and the element with a key of yellow of that
hash is being given the value 35. Now, if these are the first few
steps of the program, %lengths won't even exist, so it
first gets created. Then, since $lengths{red} doesn't
exist, it gets installed with a value of a reference to an empty hash
(via auto-vivification). Finally, the element with a key of yellow
in that hash is given the value of 35, and we're done. This is
more commonly encountered in a loop:
while (<DATA>) {
my ($source, $destination, $length) = split;
$lengths{$source}{$destination} = $length;
}
# more code here later
__END__
red yellow 35
red green 19
purple blue 12
blue orange 18
Note that once the first line is processed, creating a hash reference
for $lengths{red}, the second line doesn't create a new
hash reference, because $lengths{red} is already defined. So
the elements with keys of yellow and green are both in the same hash,
referenced by the hash element of $lengths{red}.
A variant on this for tabulation purposes involves the automatic
initialization to undef for a variable with respect to an
operator like +=. For example, the following code sums a
list of numbers:
while (<DATA>) {
my ($number) = split;
$sum += $number;
}
print "$sum\n";
__END__
3
5
19
The first time through the loop, $sum is uninitialized, and
therefore guaranteed to be undef, but this happens to be the
perfect base value for +=, treating the undef like a
0 because addition is a mathematical operation. We can apply this
to a complex data reduction:
while (<DATA>) {
my ($source, $destination, $hits) = split;
$total_hits{$source}{$destination} += $hits;
}
# more code here later
__END__
red yellow 35
red green 19
red yellow 12
blue red 18
blue red 8
Just like the previous summing example, we'll now be adding up
a summation. But we're summing the totals organized by the pair
of source crossed with destination. Looking at the first invocation:
$total_hits{red}{yellow} += 35;
Since %total_hits is empty at this point, Perl first extends
the hash to include a hashref at $total_hits{red}. This hashref
initially points to an empty hash, but then gets extended to include
an element at the key of yellow. However, since the value at
this key is being used in a +=, the initial undef value
is treated as 0, and then 35 gets added, resulting in 35. This 35
is then stored in place of the initial undef, and we're
done. When the third step is executed:
$total_hits{red}{yellow} += 12;
the value of 35 is added to 12, yielding 47, and that becomes the
updated value.
The important point here is that you write what you want it to
do, and it just works. That's the nice thing about Perl. It
very often just Does The Right Thing. So, be mystified by auto-vivification
no more: learn to embrace it, use it, and like it! Until next time,
enjoy!
Randal L. Schwartz is a two-decade veteran of the software
industry -- skilled in software design, system administration,
security, technical writing, and training. He has coauthored the
"must-have" standards: Programming Perl, Learning
Perl, Learning Perl for Win32 Systems, and Effective
Perl Programming, as well as writing regular columns for WebTechniques
and Unix Review magazines. He's also a frequent contributor
to the Perl newsgroups, and has moderated comp.lang.perl.announce
since its inception. Since 1985, Randal has owned and operated Stonehenge
Consulting Services, Inc.
|