Article

oct2000.tar

Little Acts of Magic

Randal L. Schwartz

So, let's start with some text manipulation. I have a poem written by a good friend of mine, Peg Edera, in a file named peg_poem, as follows:

The Little Acts

Maybe there is no magic.
Maybe it is only faith.
The lonely girl stuffs a letter
In a bottle, casts it in the sea.
It floats, it sinks.
What counts is that it's cast.

Those little intentions,
The petals placed on the altar,
The prayer whispered into air,
The deep breath.
The little acts,
The candles lit,
The incense burning
And we remember what counts
And we know what we need
In the little acts.

Then is when to listen.
What is it that brought you
To this?
There's the magic.
Right in hearing what
Sends you to
Hope.

Peg Edera
February 8, 2000

The title of her poem inspired the theme of this column, so it's only appropriate that we use the text as grist for the mill.

Let's start with the basics of opening the file and reading. That'd be something like:

  open POEM, “peg_poem” or die “Cannot open: $!”;
  while (<POEM>) {
    ... do something here with each line
  }

Within the body of this while loop, $_ contains each line from the poem. So on the first iteration, we get The Little Acts and a new line, and so on.

If we just want to copy the data to STDOUT, a simple print will do:

  # open...
  while (<POEM>) {
    print;
  }

Here, the print defaults to STDOUT, so we'll end up copying the input to the output. What if we wanted line numbers? The variable $. contains the line number of the most recently read file:

  # open...
  while (<POEM>) {
    print “$.: $_”;
  }

We get a nice display labeled by line numbers. Let's optimize this a bit -- there's too much typing for such a small thing happening in the middle of the loop:

  # open...
  print “$.: $_” while <POEM>;

Ahh yes, the while modifier form. Each line is still read into $_, and thus the print gets the right info.

Even the open can be optimized out of there, by using the cool “diamond” operator. The operator looks at the current value of @ARGV for a list of filenames, so let's give it one:

  @ARGV = qw(peg_poem);
  while (<>) {
    print;
  }

Notice we don't have to explicitly open now, because that's handled by the diamond. Of course, copying files is best done by a special purpose module for copying:

  use File::Copy;
  copy “peg_poem”, \*STDOUT;

But that's just another way of doing it.

Let's go the other direction: processing the information before sending it out. As an artist, I'm sure Peg appreciates the ability to include blank lines between the paragraphs of the poem. But how would we strip those blank lines on the output? Simple enough: use a regular expression:

  while (<>) {
    print if /\S/;
  }

Here, the regular expression is looking for any single non-whitespace character. If there aren't any of those, the line is at least blank-looking, and not worth printing.

Besides printing things as quickly as we read them, we can also read the entire file into memory for more interesting operations:

  while (<>) {
    push @data, $_;
  }

Each new line is added to the end of @data, which initially starts empty. Now we can print them out in the reverse order:

  for ($i = $#data; $i >= 0; $i--) {
    print $data[$i];
  }

And while this works (it's a normal for loop), it's actually less work for the programmer (and slightly more for Perl) to write this simply as:

  print reverse @data;

which takes the@data value and reverses a copy of it end-for-end before handing this new list off to print.

What if we wanted to reverse each string? Well, the reverse operator in a scalar context turns the string around. But then the new line is at the wrong end. So, it's a multiple-step procedure:

  foreach $line (@data) {
    chomp($copy = $line);
    print reverse($copy).”\n”;
  }

Here, I take the string, copy it into a separate variable (so that the chomp doesn't affect the original @data element), then reverse that variable's contents in a scalar context (because it's the operand of the string concatenation operator), and then dump that out.

Another way to grab the part of the string up to but not including the new line is with a regular expression:

  foreach (@data) {
    print reverse($1).”\n” if /(.*)/;
  }

In this case, I'm using the implicit $_ variable together with a regular-expression match to find all the characters that don't include a new line (because dot doesn't match a new line), and then using that as the argument to reverse. Magic!

We could also drift this towards a mapping operation, now that I look at it. Let's make a little assembly line:

  @reversed = map {
    /(.*)/ && reverse($1).”\n”;
  } @data;
  print @reversed;

The map operation takes every element of @data and temporarily places it into $_. The regular expression match always succeeds, and when it does, $1 contains the string up to but not including the new line, which then gets reversed and a new line is tacked on the end. Of course, we don't need that intermediate variable:

  print map {
    /(.*)/ && reverse($1).”\n”;
  } @data;

I think Peg would probably laugh at the output of that program applied to her work, so let's look at some other small magic items.

If we wanted to break the lines into a series of words, the easiest way is to apply a regular expression match with a “global” modifier to each line, like so:

  while (<>) {
    push @words, /(\S+)/g;
  }

Here, the regular expression of \S+ matches every contiguous chunk of non-whitespace characters. So, after the first line has been processed, we'll have:

  @words = (“The”, “Little”, “Acts”);

and the second line contributes nothing to the array, because there are no matches. We can shorten this slightly, using that map operator again:

  @words = map /(\S+)/g, <>;

First, the diamond operator on the right is being used in a list context, meaning that all the lines from all the files of @ARGV are being sucked in at once. Next, the map operator takes each element (line) from the list, and shoves it into $_. Next, the regular expression is being evaluated in a list context, and since the match can occur multiple times on a given string, each match contributes 0 or more elements to the result list. That result list then becomes the value for @words. Wow, all in one line of code.

The problem with this particular rendering is that we're sucking in the punctuation as well. Thus, magic is in the array as magic., and that's not the same word, especially if we want to count the words.

We can alter this a bit:

  @words = map /(\w+)/g, <>;

and now we're grabbing all contiguous alphanumerics-and-underscores, selected by the things that \w+ matches. But that breaks the word There's into two pieces. Bleh. There are many a long hacking sessions when I've wished that the definition for \w had included apostrophes but excluded underscores. So, it's a slightly more precise and explicit regex for me:

  @words = map /([a-zA-Z'])/g, <>;

That works for this poem, and leaves out those nasty date numbers as well. As a final bit of magic, let's see what the most frequent word is. Wait, some of them are initial caps, so we need to do one more hop:

  @words = map lc, map /([a-zA-Z'])/g, <>;

That fixes it so they're all lowercase and now let's count them:

  $count{$_}++ for @words;

Can it be that simple? It is. Each of the words ends up in $_. We use that as a key to a hash. The value initially is undef, and incremented as we see each word.

Now it's time to dump them out:

  @by_order = sort { $count{$b} <=> \
    $count{$a} } keys %count;
  for (@by_order) {
    print “$_ => $count{$_}\n”;
  }

That dumps them out in the sorted order. We're using a sort block to control the ordering of the keys, then dumping them in that order. I hope you see that Perl can be a little magic at times. It's the little things that count. Until next time, enjoy!

About the Author

Randal L. Schwartz is a two-decade veteran of the software industry -- skilled in software design, system administration, security, technical writing, and training. He has coauthored the “must-have” standards: Programming Perl, Learning Perl, Learning Perl for Win32 Systems, and Effective Perl Programming, as well as writing regular columns for WebTechniques and Unix Review magazines. He's also a frequent contributor to the Perl newsgroups, and has moderated comp.lang.perl.announce since its inception. Since 1985, Randal has owned and operated Stonehenge Consulting Services, Inc.