Cover V10, I04
Article

apr2001.tar


A MIME Is a Terrible Thing to Waste

Randal L. Schwartz

The Multipurpose Internet Mail Extensions (MIME) standard has been around for nearly a decade, but has only recently become popular. This is probably because of the higher bandwidth data connections available for email, as well as the advent of the Web, and the desktop horsepower required to make things that are fancier than plain text (or should that be text/plain)?

MIME is both a blessing, and a curse, in my opinion. It's cool that I can send a PDF or a JPEG to a friend as an attachment, and know that I don't need to figure out if they have a uu-decoder or a shell to extract from a sharchive. It's bad, however, because a lot of mail that is really just plain old text is being sent as HTML mail or the very popular "multipart/alternative" mail.

Why is this bad? Well, for one, I don't think Tim Berners-Lee or any of the chaps involved with the creation of the Web envisioned HTML as a medium for email. HTML is about hyperlinks and structured text, readable in an interactive environment. Email is a simple message, usually conversational, and generally with an absence of a need for markups and links.

So most of the use of HTML mail these days is by the "push advertisers", or as we more often call them, "spammers". It's a great way to shove a flashy, sizzly, no-content ad for fax paper or a trip to Central America into our email boxes, with enough bouncy clicky things that we'll probably respond.

A more serious problem with HTML email is that it's a great carrier of Javascript viruses. Countless times I've read about people getting nailed because of embedded codes in HTML email. Thus, it's a security threat to organizations.

That's why I think mail should always be plain text, unless both parties agree otherwise. Go ahead, shoot me, but there's my opinion.

Apparently, my opinion is not shared by the makers of some of the so-called mail clients, like Outlook Express or Netscape Communicator. Out of the box, every mail sent is as the multipart/alternative MIME type, with a text version and a HTML version. Theoretically, if you have a MIME-savvy mail client, you receive such mail as a nice HTML-formatted window. If not, you get gibberish for the second half of your text screen. And they call that communicating.

Sure, you can turn it off. Perhaps. But read on, and you'll see where this is going.

Now, here's the problem. I run a low-volume mailing list for a management class I'm taking... nothing fancy... just a rebroadcaster called from procmail. I started to see a lot of these HTML forked messages, and got annoyed when some of the replies also quoted part of the MIME wrapper markup, making it hopeless to read in any normal sense.

So I put a filter in the mail forwarder to kick back anything that included either boundary or html in the content-type mail header... a sure sign that someone was sending something other than plain text. Yes, right after inserting that filter, the worst offenders were unable to use my mailing list until they figured out how to turn that HTML fork off, and then all was good.

In this most recent group of users, we had a couple of people who had installed Outlook with Windows 2000 (not Outlook Express). Even after I had called in favors from my friends who understand Redmond-ware better than I do, they still couldn't figure out how to turn off the durn HTML.

So what to do? I wasn't about to relax my policy, having been very happy with the result achieved with the previous group. And one of them had started painfully copying all the addresses directly into their address book, a mess for maintenance, and trouble for the Web-based archive for the mailing list.

Then I thought, "Hey, all I need is a small Perl filter that recognizes this so-called email and strips the HTML fork!" And that's what I decided to build.

Luckily, we've got the very nice MIME::Tools package in the CPAN to do most of the hard work, although I admit it took me a few false starts to get the project done.

First, let's hack out some code to take a brain-damaged email on standard input, writing out a clean piece of email on standard output (untouched if it's not the right format). We'll start with three lines that begin nearly every program I write:

#!/usr/bin/perl -w
use strict;
$|++;
This enables warnings, turns on the compiler restrictions (no symbolic references, undeclared variables, or barewords), and unbuffers standard output. Next, we grab the "envelope-from" from the input:

my $envelope = <STDIN>;
This "envelope-from" looks like:

From merlyn  Wed Jan 24 11:37:17 2001
and tells the next mailer where this mail came from. It's actually not in the shape of an RFC822 header, because it's a "meta-header", and therefore shouldn't be parsed along with the rest of the MIME information. We'll grab it here, and print it back out when we're done.

Next, we'll pull in two of the modules from the MIME::Tools distribution:

use MIME::Parser;
use MIME::Entity;
And then we'll create a MIME::Parser object to read the input:

my $parser = MIME::Parser->new;
$parser->output_to_core(1);
$parser->tmp_to_core(1);
Here, I'm creating a MIME parser that keeps everything in core, including any temporary files. Of course, this will break down if someone sends me a 200-MB AVI file, but I can catch that at the step before this anyway.

Now it's time to read standard input:

my $ent = $parser->parse(\*STDIN);
The $parser object reads the email message from standard input into memory. If there's any failure here (bad input, bad format), the parser will die. We'll call this program so that if it fails in any way, the original message is kept, so the death is not an issue.

Now for the cool part. I can use the methods available on the message (a MIME::Entity object) to probe into the structure. One of the first ones I did simply turned the rest of the program into:

$ent->dump_skeleton(\*STDERR); exit 1;
This caused the program to show the structure of message, so I could figure out what an HTML-forked mail message looks like, compared to everything else. After I ran that on a few sample messages, I removed that line and replaced it with this:

if ($ent->effective_type eq "multipart/alternative"
    and $ent->parts == 2
    and $ent->parts(0)->effective_type eq "text/plain"
    and $ent->parts(1)->effective_type eq "text/html") {
Whoa. Lots of stuff here. Let's go slow. First, I'm seeing if the top-level structure is a multipart/alternative. A MIME document is hierarchically structured (attachments can have attachments, and so on), so we're looking at the root here. If that's good, then we also make sure there are two alternatives, and that the first one is a plain text entry, and the second one is HTML. If so, it's likely to be the evilness that I'm trying to fix. (There's a very small chance that the text and HTML parts are radically different and unrelated, but if so, it's mistagged as multipart/alternative rather than the more proper multipart/mixed type.)

So the next step is to extract the text part as its own entity, and then hoist that part to become the entire message. There may be an easier way of doing this, but here's what I did. First, make a new entity from the body of the old text one:

my $newent = MIME::Entity->build(Data =>
                      $ent->parts(0)->body_as_string .
                      "\n\n[[HTML alternate version deleted]]\n");
Notice that I added a little message on the end to let people know magic has happened. I could have also inserted it into the mail header instead, but I wanted it to be prominent.

Next, we toss all the parts except for this one:

$ent->parts([$newent]);
And then, we fold it from a multipart document to a single-part document (where MIME is not even mentioned, and we have no boundary markers):

$ent->make_singlepart;
And finally, some of the headers were now out of sync, so it's time to clean it up as best we can:

  $ent->sync_headers(Length => 'COMPUTE', Nonstandard => 'ERASE');
}
And that's it. If it met the ugly-message criteria, we now have a new message in $ent; otherwise, we have the original. Time to dump it out. First the envelope:

print $envelope;
And now the message itself:

$ent->print;
The next step was to hook it into procmail delivery for the mailing list. Ahead of the step that does the actual sending, I added one additional rule:

:0 fw
* ^Content-type:.*boundary
| $HOME/lib/Strip-HTML-fork
where $HOME/lib/Strip-HTML-fork contains the program above. If the filter is able to do its magic, then the next procmail rule starting with:

:0
* ^Content-type:.*(html|boundary)
{
  .. bouncing logic not shown ..
}
no longer triggers, and the mail goes through! Success.

Well, I hope I've convinced you that a MIME is a terrible thing to waste, but once wasted, we can fight back properly. Until next time, enjoy!

Randal L. Schwartz is a two-decade veteran of the software industry -- skilled in software design, system administration, security, technical writing, and training. He has coauthored the "must-have" standards: Programming Perl, Learning Perl, Learning Perl for Win32 Systems, and Effective Perl Programming, as well as writing regular columns for WebTechniques and Unix Review magazines. He's also a frequent contributor to the Perl newsgroups, and has moderated comp.lang.perl.announce since its inception. Since 1985, Randal has owned and operated Stonehenge Consulting Services, Inc.