Sidebar: What Uses Regular Expressions?
Regular expressions are a way of representing string
patterns for
searching in the UNIX editors, search programs, and
the awk programming
language. They provide a way of representing a general
string pattern
that could match either a specific fixed string, such
as someone's
name, "Larry," or any number of possible strings,
such as
"[Ll]arry," which matches whether the string
is capitalized
or not.
The editors ed, ex, and vi; the stream editor,
sed; the search programs grep and egrep (but
not fgrep); and the awk language all use regular expressions
in their searching operations. Other programs not included
with UNIX
but frequently found on UNIX systems, such as emacs
and Perl, also
use regular expressions. Furthermore, even programs
that you write
can use regular expressions, since the regcmp(1) program
will
precompile a regular expression into a C program, usable
by shell
scripts, and the regcmp(3G) function call will allow
the use
of regular expressions from within another C program.
Certain characters function as metacharacters -- that
is, as characters
that can be used either as commands or as literals in
a fixed string.
However, if a metacharacter is to function as an ordinary
character,
it's necessary to insert a symbol to prevent its being
interpreted
as a command. This is rarely a serious problem, since
the metacharacters
are punctuation characters that do not usually appear
with the alphanumeric
characters commonly used for searching.
The metacharacter most likely to be needed as a fixed
character is
the period (.), which might appear as a decimal point
in the
midst of a number you are searching for. If you specified
"23.45",
for example, since the period as a metacharacter stands
for any character,
you have actually asked for "23" followed
by any character
followed by "45". Since an actual period could
be the "any
character," this might work just as written, but
if the search
finds "23945," that would match, too! To force
the period
to be simply a period, put a backslash (\) in front
of it,
"23\.45". This is known as "escaping"
the metacharacter.
But if the backslash is used as an escape metacharacter,
how do you
specify the backslash? Use "\\", which makes
the backslash
metacharacter a regular backslash.
The vi editor is a screen-oriented front end to the
ex
editor. When you use vi's slash command (/) to search,
most of the regular expression metacharacters will work,
but some
do not. However, a few more of the metacharacters work
with ex.
Recall that any command in vi beginning with a colon
(:)
is really an ex command: thus, ":s" is the
ex substitute
command, which searches for a regular expression and
replaces it with
another regular expression. So, all of the regular expressions
that do not work in vi but do work in ex can
be made to work by using the equivalent ex colon command.
The grep family of programs really consists of three
different
programs that all do searching. The fgrep program works
only
with fixed strings, so regular expressions cannot be
used with it.
(Contrary to a lot of popular belief, the f in fgrep
does not stand for "fast", since egrep is
faster. It
stands for "fixed-string.") The grep and egrep
programs do use regular expressions, but each uses a
different set
from the other. egrep uses a larger set, but grep does
use one handy metacharacter set (the \{\} range specifier)
that egrep
does not use. If you need the range specifier, use grep,
not
egrep. Otherwise, egrep is the fastest of the three
greps, allows more metacharacters than grep, and can
handle more complex regular expressions when needed.
The sed program uses the same regular expressions as
grep
(but not the same ones as egrep). The grep programs
do not perform replacements, but the sed program does,
so a
few metacharacters used for that are added to sed's
list. For
instance, the \(\) set allows specification of a subexpression
that can be referred to later in the search expression
or in the replace
expression. A reference to the first subexpression is
done with the
\1 metacharacter set, to the second with the \2, and
so on for up to \9. The interesting advantage is that
the \#
refers to the actual characters matched, which might
be unknown until
the time they are matched. Thus, "\(ab.de\)"
when referred
to by \1 might contain any character between the "b
and
"d," and whatever it turns out to be will
be used by the \1".
The search and replace use of regular expressions is
paramount to
the successful use of sed.
Finally, awk uses the same set of regular expression
metacharacters
as egrep in the pattern part of its pattern
{ action } syntax. Any of these very powerful programs
have
very limited usefulness without the addition of regular
expressions.
Regular expressions provide the ability to search for
sets of possible
strings, the exact contents of which maybe unknown,
but for which
the format is specifiable.
|