| Taint so Easy, Is It?
 Randal L. Schwartz
 
If youve been reading my columns for any length of time, youve probably seen me mention taint mode, usually briefly while Im describing a hash-bang line of something like: 
 
#!/usr/bin/perl -Tw
which turns on warnings (the -w) and taint mode (the -T). But what is taint mode? 
Taint mode is a security feature of Perl and includes two levels of operation. First, while taint mode is in effect, some operations are forbidden. One of these is that $ENV{PATH} cannot contain any world-writeable directories when firing off a child process (like with backticks or system). Should your program attempt an unsafe action, the program aborts immediately (via die), before the action has a chance to create a potential security violation. You could have included code to check this yourself, but having Perl perform the checks ensures a consistency and a best practices level of competence that you may not have the capability or resources to include explicitly. 
The second level of operation is much more interesting and unique to Perl (amongst all the popular languages I know of), in which Perl keeps track of a distrust of each scalar value in the program. Every item of data coming from input sources (command-line arguments, environment variables, locale information, some system calls, and all file input) is marked tainted. 
For example, the following operations all generate tainted data: 
 
$t1 = <STDIN>;
$t2 = $ENV{USER};
$t3 = $ARGV[2];
@t4 = <*.txt>;
In each of these examples, the data has come from the outside world, and is therefore treated as potentially dangerous. Once data is tainted, the taint propagates to any data derived from the tainted data:
 
$t5 = $t4[0];
$t6 = "/home/$t2";
chomp($t1);
@x = ("help", "me", $t3, "please");
Note that tainting is on a per-scalar basis. So $x[2] is tainted, not the entire array @x.
Once data is marked tainted, nearly any attempt to use the data to affect the outside world will be blocked, causing an immediate die with a taint violation. For example, invoking rename, where either the source name or destination name is tainted, is considered dangerous. This permits normal operations: 
 
rename $x[0], $x[1];
But not operations that involve tainted data (recall that $x[2] is tainted from earlier): 
 
rename $x[0], $x[2];
This means that data coming in from the outside world cannot trivially affect the outside world as well. Why is this important? 
Well, the typical use of taint mode is to enable programs that act on behalf of other users to operate in a safer manner. For example, a setuid or setgid program borrows the privileges of its owner for the duration of execution, allowing an ordinary user to act as root (or some other user) for a selected set of operations. Or a CGI program, executing as the Web server ID (typically nobody), is acting with that users privileges on behalf of a request from any Web client, generally without direct access to the server except through the Web server. 
In both of these cases, its important that input data be checked so as not to permit the user who invokes the program from borrowing the privileges of the executing user ID to perform unintended actions. 
For example, itd be pretty dangerous to rename a file based on the input from a CGI form: 
 
use CGI qw(param);
...
my $source = param('source');
my $dest = param('destination');
rename $source, $dest;
Now perhaps the author of this CGI script believed that since the form contained only radio buttons or pop-up menus that were clearly defined that this would be a safe program. But in reality, a person with intent to damage or break in could just as easily invoke this script passing arbitrary data in source and destination, and potentially rename any file to which the Web userid has access!
With taint mode enabled, the CGI parameters (having been derived from either reading STDIN or an environment variable) are marked tainted, and therefore the rename operation would fail before it has committed potential damage. (To enable taint mode on a CGI script, just include -T in the #! line, as shown earlier.) And thats exactly the safest thing to do here. 
But obviously, there are times when input data must in fact legitimately affect the outside world. Heres where the next feature of taint mode comes in. As a sole exception, the results of a regular expression memory reference (usually accessed as the numeric variables like $1 and $2 and so on) are never tainted, even though the match may have been performed on tainted data. This gives us the carefully guarded gate in the fence, when used properly. For example: 
 
my $source = param('source');
unless ($source =~ /^(gilligan|skipper|professor)$/) {
  die "unexpected source $source\n";
}
$safe_source = $1;
Here, $source is expected to be one of gilligan, skipper, or professor. If not, well die before executing the next statement, which copies the captured memory into $safe_source. (Note that the parens in the regular expression match are performing double duty, needed for both proper precedence regarding the vertical bar and the beginning and ending of string anchors, as well as having the side-effect of setting up the first backreference memory. Sometimes, you get lucky.)
The value of $safe_source can now legitimately be used in the rename operation earlier, as it came from a regular expression memory, and not directly from input data. In fact, we could even have assigned it back over $source (a common thing to do): 
 
$source = $1; # source now untainted
Of course, wed have to perform a similar operation on $destination to complete the operation. 
So, if someone attempts to give us an incorrect value for the source parameter, like ginger, the program aborts. Certainly, this program would have aborted with or without taint mode, but in taint mode it works only because we added the extra code to perform a regular expression match, during which we needed to think about what the possible legal values for the string might have been. 
And that brings up the next point: we typically cant perform an explicit match against a known list of values. More often, the data is a user-specified value that needs to fit a general description. But again, regular expressions are pretty good at matching many things. 
So, lets say the $source there came from a text field box, rather than a pop-up menu, permitting an arbitrary string. How do we pass that along to the rename operator? First, we must decide what a legitimate string might be. For example, lets restrict to filenames that contain only \w-matching characters, including a dot (as long as the dot is not the first character). Thatd be like this: 
 
$source = param('source');
$source =~ /^(\w[\w.]*)$/ or die;
$source = $1;
Once again, if the string is not as expected, we die. And only if we havent died will we continue on to use $1, which has now been verified to be a name of the form that we expect.
Note that its very important to test the result of the regular expression match, because $1 (and the other memory variables) is set only when you have a successful regular expression match. Otherwise, you get an earlier match, and thats definitely bad news: 
 
## bad code do not use ##
$param('source') =~ /^(\w[\w.]*)$/;
$source = $1;
## bad code do not use ##
A slightly more compact way of writing this correctly might be:
 
my ($source) = param('source') =~ /^(\w[\w.]*)$/
  or die "bad source";
Here, Im using $1 implicitly as the list context result of the regular expression match, and declaring the variable that will hold it, and checking for errors, all in one compact statement.
The regular expression pattern should be as restrictive as you can get. For example, if you use something like /(.*)/s, youve effectly removed any of the benefits of taint mode for that particular data, making it potentially possible for someone to hijack your program in unintended ways. 
So, I hope this gives you a bit of insight into how to use taint mode, and why it is useful. If this column taint enough for you, I suggest you check out the perlsec manpage (perhaps using the command perldoc perlsec at a prompt). Until next time, enjoy your new security knowledge.  
 
 About the Author
Randal L. Schwartz is a two-decade veteran of the software industry  skilled in software design, system administration, security, technical writing, and training. He has coauthored the must-have standards: Programming Perl, Learning Perl, Learning Perl for Win32 Systems, and Effective Perl Programming, as well as writing regular columns for WebTechniques and Unix Review magazines. Hes also a frequent contributor to the Perl newsgroups, and has moderated comp.lang.perl.announce since its inception. Randals desire to give back to the Perl community inspired him to help create and provide initial funding for The Perl Institute (perl.org). He is also a founding board member of the Perl Mongers (pm.org), the worldwide association of Perl users groups. Since 1985, Randal has owned and operated Stonehenge Consulting Services, Inc.
 
 |