Using Regular Expressions
Larry Reznick
Regular expressions, which provide a means of representing
string
patterns for searches, are supported by most of the
common UNIX utilities,
yet many system administrators do not know how to use
them. Since
regular expressions can be used in combination with
the existing UNIX
editors and utilities to simplify a number of important
tasks, it's
worthwhile to learn to work with them.
Possibly the hardest part of mastering regular expressions
is understanding
their meaning. Learning what each individual character
means is simple
enough, deciphering a particular regular expression
filled with cryptic,
write-only symbols seems to be more than most people
want to do. (Sometimes,
it seems as if guru-hood should be bestowed on you if
you can only
figure out what those funny-looking characters are doing.)
The easiest way I know of to interpret a complex regular
expression
is to consider each character as a separate command
in combination
with the commands that come before it. In other words,
don't worry
so much about the whole thing, but only about one character
at a time.
And, since regular expressions are used in string searching
commands,
always start out with the words "Find a string
composed of . .
.."
For instance, say that you encounter the following regular
expression
in a sed command:
s/ */ /g
(where there are two tabs before the * and also
one between the second and third /'s). The s is the
sed substitute command, and the first slash specifies
the beginning
of the search string. (It also serves as the delimiter
between the
search and replace strings; if a slash is to be searched
for, any
other character may be used for the delimiter.) To the
star, then,
the expression reads "Find a string composed of
a tab followed
by another tab." The star signifies "zero
or more of the previous
character"; so we have "Find a string composed
of a tab followed
by another tab, zero or more, and replace it with a
single tab."
The trailing g tells sed to apply this globally throughout
the line. Without it, the replacement would apply only
to the first
such match found in the line -- any others in the same
line would
be left alone.
The same syntax -- but with the :s command -- could
be
used within a single file by the ex editor underneath
vi,
but this would apply only to that one file, while sed
could
be made to apply to many files. Another approach to
the same problem
would be to use awk, which adds a metacharacter that
sed and
ex do not understand, the + symbol, which means "one
or more." To get sed to do "one or more,"
the first
appearance of the tab character had to be explicitly
typed, and then
the second one had to be specified with the star. If
the star had
been applied to the one-and-only tab, the "zero
or more" definition
would have caused a substitution whenever a tab was
not found as well
as when it was! (This would cause every character to
have a tab placed
after it -- try it sometime, but do not save the result.
Pipe the
output through cat -tve and the tab characters will
appear
as ^I symbols.) With awk, though, you could specify
the match
by entering the + after a single tab, which would say,
"Find
a string composed of a tab, one or more." This
kind of replacement
is better done with sed than awk, though, because sed
automatically outputs everything that does not match
as well as the
results of the replacement when a match is found, while
awk would
have to be explicitly told what to do with the non-matching
lines
as well as the matching lines.
An example of a more complex set of regular expressions
can be found
in the sending of man pages to the printer. In SVR4
and SCO's
current version 4 of their SVR3, the man command now
outputs
the actual characters for creating boldface and underlined
characters.
The boldface is done by backspacing and overstriking
the same character
several times before moving on to the next character,
while the underlining
is done by writing the underscore character, then backspacing
and
writing the actual character. On a fast terminal this
can be nice
to read (although terminal handlers usually will not
show these unless
the output is piped through the /usr/ucb/ul program,
which
adds ANSI escape sequences to use the various modes
the terminal can
produce); but on a slow terminal, such as over a modem
dialup line
or, even worse, on the printer, this can make things
agonizingly slow.
Regular expressions make it easy to prevent the command
from outputting
the boldface and underline characters. Notice that the
character preceding
the backspace will either be the underscore or the first
instance
of a repeated character for bolding. The character following
the backspace
is the only one wanted for the output (unless it happens
to be another
of the bolding characters, but if it is, it will be
followed by yet
another backspace). So, the trick is to eliminate any
character followed
immediately by the backspace, as well as the backspace
itself:
man whatever |
sed 's/.\^H//g'
where "whatever" represents the man page
to be filtered by the sed command, and the ^H represents
a backspace keystroke. Again, the sed substitute command
is
used. The regular expression says, "Find a string
composed of
any character followed by a backspace." The backslash
(\) before
the backspace character escapes the backspace so that
it will be interpreted
as a backspace character, not the usual backspacing
action that your
keyboard filter might perform. Any string that matches
this gets replaced
by nothing, which deletes it from the output. This operation
is done
globally throughout the input line, and, since sed acts
on
all the lines input, it will be performed throughout
the file. Since
the output automatically goes to the standard output,
if you want
to see the man page on the screen, simply pipe it through
your
favorite pager. If it should be printed, pipe this to
the print spooler.
But, if ANSI escape sequences are built into the output,
say because
you have set your PAGER variable to automatically route
the output
of man through the /usr/ucb/ul program, how do you get
rid of those when you want to pipe the output to the
printer? Most
of the ANSI escape sequences are of the form
ESC [ params char
where ESC stands for the escape character, which
appears as ^[ to cat -tve; params indicates an
optional number with multiple numbers separated by a
semicolon (;);
and char refers to some alphabetic or punctuation symbol
representing
the particular ANSI code action to be performed.
You must use regular expressions to deal with this because
the params
and the char could be almost anything, and the params
might even be nonexistent due to reasonable default
values. Begin
with "Find a string composed of an ESC followed
by a bracket,"
which would be^[\[ (the ^[ is a representation of
the escape character, which the backslash causes to
be uninterpreted
by the keyboard handler; the bracket itself must be
escaped since
it is a regular expression metacharacter that will function
here as
a normal character).
To represent the optional digits, use [0-9]*, which
says, "any
of the characters in the range 0 to 9, zero or more."
The bracket
characters delimit a set of characters to be treated
as a single regular
expression character (any of the set may be matched),
so the star
applies to all of those in the set. This will match
any number, no
matter how many digits there are, yet because of the
"zero or
more" interpretation of the star, the case where
no digits are
found will also match. Remember, too, that multiple
numbers could
occur, such as 123;456;789, so you must include the
semicolon
with the digits, thus [0-9;]* becomes the correct subexpression.
Finally, any upper-case alphabetic character many of
the lower-case
characters, and two of the punctuation marks (specifically,
@
and `) might follow the optional number, and in a few
cases, a
single space might precede the character. These characters
identify
exactly which control function is to be used. The ANSI
and ISO committees
specified that any of the characters between 40 hex
and 6F
hex inclusive (except for those between 5B hex and 5F
hex inclusive) may be used without the space, and any
between 40
hex and 52 hex, inclusive (except for those between
4A
hex and 4D hex inclusive) may be used with the space.
We
probably do not have to get quite that picky and could
simply represent
this as [ @-o], which says, "any of either the
space character
or the characters ranging from @ to o."
The problem with this formulation is that, if the space
matches, it
will be followed by another character, while if it does
not match,
the other characters are sufficient to complete the
entire match.
As a result, the expression completes even if nothing
but a space
comes up. To avoid this, we might write instead, [space]*[@-o],
which says, "a space, zero or more, followed by
any of the characters
@ to o.<
Now,^[\[[0-9;]* *[@-o] becomes the full expression. Combining it with the sed command line that eliminates the underlining
and boldfacing, we would have:
sed -e 's/.\^H//g'
-e 's/\^[\[[0-9;]* * [@-o]//g'
which would receive data piped into it from the man
command. (Multiple expressions are needed since two
separate searching
operations are to be applied to every single line of
input.)
There is another possible problem: due to an error in
ANSI/ISO code
generation, if more than one space appeared before the
appropriate
action character, this expression would accept that
as legitimate
and act on all those spaces. However, since the intention
here was
not to handle escape code syntax checking issues, this
regular expression
will probably suffice. The ? ("zero or one")
metacharacter,
available in awk and egrep, could handle this problem
by limiting
acceptable values to either zero or one matching space,
but no more.
Although the sed program does not recognize that particular
metacharacter, it does acknowledge the range metacharacters,
which
can be used to duplicate this functionality. By adding
\{0,1\},
you can specify "a space occurring between 0 and
1 times."
So, the final sed command is:
sed -e 's/.\^H//g' -e 's/\^[\[[0-9;]*
\{0,1\}[@-o]//g'
which translates as, "first expression: substitute,
find a string composed of any character followed by
the backspace,
replace it with nothing, globally," and "second
expression:
substitute, find a string composed of an escape character
followed
by a bracket followed by any of the digits or a semicolon,
zero or
more times, followed by a space, occurring between 0
and 1 times,
followed by any of the characters between @ and o inclusive,
replace
it with nothing, globally."
The use of regular expression metacharacters is similar
to programming
a pattern-matching-oriented little language. By examining
each of
the regular expression metacharacters individually,
rather than trying
to interpret the entire collection of cryptic symbols,
you can find
and manipulate just about any pattern of characters.
Combining regular
expressions with the common UNIX utilities enhances
the functionality
of those utilities. In addition, making the expressions
available
in various scripts that you or your users can work with
will make
many jobs simpler -- while relieving you of the need
to write new
tools.
About the Author
Larry Reznick has been programming professionally since
1978. He is
currently working on systems programming in UNIX and
DOS. He teaches
C language courses at American River College in Sacramento
and is the
owner of Rezolution Technical Books. He can be reached
via email at:
rezbook!reznick@csusac.ecs.csus.edu.
|