Cover V01, I03
Article
Sidebar 1
Sidebar 2

sep92.tar


Sidebar: What Uses Regular Expressions?

Regular expressions are a way of representing string patterns for searching in the UNIX editors, search programs, and the awk programming language. They provide a way of representing a general string pattern that could match either a specific fixed string, such as someone's name, "Larry," or any number of possible strings, such as "[Ll]arry," which matches whether the string is capitalized or not.

The editors ed, ex, and vi; the stream editor, sed; the search programs grep and egrep (but not fgrep); and the awk language all use regular expressions in their searching operations. Other programs not included with UNIX but frequently found on UNIX systems, such as emacs and Perl, also use regular expressions. Furthermore, even programs that you write can use regular expressions, since the regcmp(1) program will precompile a regular expression into a C program, usable by shell scripts, and the regcmp(3G) function call will allow the use of regular expressions from within another C program.

Certain characters function as metacharacters -- that is, as characters that can be used either as commands or as literals in a fixed string. However, if a metacharacter is to function as an ordinary character, it's necessary to insert a symbol to prevent its being interpreted as a command. This is rarely a serious problem, since the metacharacters are punctuation characters that do not usually appear with the alphanumeric characters commonly used for searching.

The metacharacter most likely to be needed as a fixed character is the period (.), which might appear as a decimal point in the midst of a number you are searching for. If you specified "23.45", for example, since the period as a metacharacter stands for any character, you have actually asked for "23" followed by any character followed by "45". Since an actual period could be the "any character," this might work just as written, but if the search finds "23945," that would match, too! To force the period to be simply a period, put a backslash (\) in front of it, "23\.45". This is known as "escaping" the metacharacter. But if the backslash is used as an escape metacharacter, how do you specify the backslash? Use "\\", which makes the backslash metacharacter a regular backslash.

The vi editor is a screen-oriented front end to the ex editor. When you use vi's slash command (/) to search, most of the regular expression metacharacters will work, but some do not. However, a few more of the metacharacters work with ex. Recall that any command in vi beginning with a colon (:) is really an ex command: thus, ":s" is the ex substitute command, which searches for a regular expression and replaces it with another regular expression. So, all of the regular expressions that do not work in vi but do work in ex can be made to work by using the equivalent ex colon command.

The grep family of programs really consists of three different programs that all do searching. The fgrep program works only with fixed strings, so regular expressions cannot be used with it. (Contrary to a lot of popular belief, the f in fgrep does not stand for "fast", since egrep is faster. It stands for "fixed-string.") The grep and egrep programs do use regular expressions, but each uses a different set from the other. egrep uses a larger set, but grep does use one handy metacharacter set (the \{\} range specifier) that egrep does not use. If you need the range specifier, use grep, not egrep. Otherwise, egrep is the fastest of the three greps, allows more metacharacters than grep, and can handle more complex regular expressions when needed.

The sed program uses the same regular expressions as grep (but not the same ones as egrep). The grep programs do not perform replacements, but the sed program does, so a few metacharacters used for that are added to sed's list. For instance, the \(\) set allows specification of a subexpression that can be referred to later in the search expression or in the replace expression. A reference to the first subexpression is done with the \1 metacharacter set, to the second with the \2, and so on for up to \9. The interesting advantage is that the \# refers to the actual characters matched, which might be unknown until the time they are matched. Thus, "\(ab.de\)" when referred to by \1 might contain any character between the "b and "d," and whatever it turns out to be will be used by the \1". The search and replace use of regular expressions is paramount to the successful use of sed.

Finally, awk uses the same set of regular expression metacharacters as egrep in the pattern part of its pattern { action } syntax. Any of these very powerful programs have very limited usefulness without the addition of regular expressions. Regular expressions provide the ability to search for sets of possible strings, the exact contents of which maybe unknown, but for which the format is specifiable.