Cover V04, I04
Article
Listing 1

jul95.tar


Creating terminfo Source from Binaries

Larry Reznick

A UNIX system I recently started using didn't have the binary terminfo entry I needed for my remote login. Looking further, I found that that system didn't have a source file for the installed terminfo entries. A missing terminfo source file wasn't unusual for systems I've used and administered -- some have them, some don't. But I needed an entry or I was going to get stuck with a dumber terminal emulation.

Another system I use had the entry I needed, but didn't have a terminfo source file either. I could run infocmp(1M), capture the entry I needed, bring it over to the new system, and install it, but I've learned from experience that each system will inevitably be missing entries that the other system has.

Rather than solve my little problem, I decided to solve the more general problem of a lack of terminfo source code. To accomplish this I created a script that turns every compiled terminfo entry into its equivalent source. I group the sources into a single source file called terminfo.src, bring the resulting source file from one system to the other, then run the same script. The resulting source file combines the source for entries from both systems, creating a master terminfo source file. Any time I find or create a new entry, I add its source to this master. I run the script again on other systems to extract sources from their binaries and add them to the master source file automatically. buildinfo (Listing 1) is the terminfo source creation script.

Building terminfo.src

infocmp creates source code from an existing terminfo binary file. If you don't give infocmp the name of an entry, it takes the name in the TERM variable and prints that name's source. Otherwise, given only one name, it shows the source for that named entry if there is one. Given two names, it shows the differences between the two entries. Because of this two-name format, I couldn't simply give infocmp a long list of names in one command line. infocmp uses its -I option when fewer than two arguments are given. When two arguments come, it defaults to -d to compare the two. Using an explicit -I option, I could give several names at once, but then I'd have no reliable way of knowing that an entry has already appeared.

terminfo allows terminal name aliases by separating aliases with a vertical bar (|) in the first line of the definition. When tic(1M), the terminfo compiler, creates a binary from the source definition, it creates one binary file and then hard-links each alias to the binary file. buildinfo has to know when a terminal entry already exists in the source file so that it won't duplicate the entry.

Searching through the terminfo.src file for duplicate entries would take a long time. Each entry defines the terminal's characteristics, so the source file containing all these entries is huge. To shorten the search, buildinfo creates a separate file composed only of the single lines holding the aliases. Such a list file is significantly smaller than the master file. In one example, 740 terminfo entries created a combined source file 664,976 bytes long, while the alias list file was only 34,478 bytes. Given a file at five percent of the source file's size, the search potentially runs 20 times faster than it might run through the master file, barring search optimizations.

For the alias list file to work, I had to add an entry to the alias list for each new alias I found. Any terminfo name -- even the very next one evaluated -- could be an alias for an entry already found. buildinfo couldn't use infocmp's -I option to build a list of names because that generates all requested entries despite duplicate aliases.

I considered building a name list in a variable. I could add each new name to the variable by using

ADDLIST="$ADDLIST $NEWNAME"

After ADDLIST had been built, I could execute

infocmp -I $ADDLIST >>terminfo.src

Unfortunately, the full solution isn't this simple. I can't check on whether a new name already exists in the list file unless the previous aliases are present. Putting new aliases in the list requires running infocmp an extra time to get the alias line. Running infocmp twice for each name slows down the program too much.

Instead, I could run infocmp once, then put each new entry into terminfo.src and in the alias file as soon as buildinfo knows the entry belongs in the source file. But first, I had to resolve a peculiar searching problem.

The Great Alias Search

buildinfo's primary idea is simple: for each filename in all the /usr/share/lib/terminfo/? directories, stored in the MASTERDIR variable, see if the name has already appeared in the alias list. Use egrep, the fastest of the grep family, to search the alias list. If the name appears, skip it. If the name is missing, use infocmp to add it to the source file and to the alias list. The real problem comes in searching through the alias list. Aliases frequently contain two characters special to egrep: plus signs (+) and vertical bars (|).

Alias names sometimes contain plus signs to identify special variations that add features to another entry. Here are several alias lines containing plus signs:

adm3a+|3a+|adm3aplus|lsi adm3a+,
ansi+arrows,
ansi+cup,
ansi+erase,

Notice that the first entry, using a plus to identify the terminal's brand name, spells out the word "plus" in one alias, but uses plus signs in two other aliases. egrep uses the plus sign as a regular expression metacharacter that means search for one-or-more appearances of the previous expression or character. buildinfo must escape the plus signs so that egrep can find them as simple characters.

buildinfo fixes the plus signs using the following line:

T=`echo "$TIFILE" | sed 's/+/\\\\+/g'`

TIFILE contains the current terminfo file's name, which might have a plus. Using echo to redirect, sed can operate on the alias name stored in the variable. The idea of sed's expression is to substitute a plus (+) with an escaped plus (\+). This substitution is applied globally through the filename in case more than one appears. To emit a single backslash in front of the plus, sed needs to see a doubled backslash (\\+). You can see this yourself with the following Bourne shell command lines:

tf="a+b"; echo $tf
echo $tf | sed 's/+/\\+/g'
unset tf

However, this entire sed replacement is happening in a subshell because of the back-apostrophes. Back-apostrophes assign sed's output to the temporary T variable. Once the subshell is finished, the current shell completes the assignment by analyzing the rest of the command line, now composed of T=\+ when only two backslashes are used. Bourne shell naturally removes the only backslash and the expression is left with the same plus it started with. No progress.

So, Bourne shell needs to see doubled backslashes to emit one backslash, just as sed did. That means sed must have quadruple backslashes. sed takes the four backslashes and turns them into two, outputting\+ for every input + instance. The shell, now seeing the two backslashes, delivers one backslash and plus where originally only a plus appeared.

The vertical bar separating aliases is another special egrep regular expression character. An alias could be the first in the list, so it wouldn't have a leading bar. The last alias would have a leading bar but wouldn't have a trailing bar. An alias in the middle would have both a leading and a trailing bar. However, not every terminfo entry has aliases. A single name has no bars unless a description follows. Without a description, the name terminates with a comma. Here are several example alias lines:

ansi|generic ansi standard terminal,
guru-24,
ibmpc|ibm-pc|ibm5051|5051|IBM Personal Computer,
vt100|vt100-am|dec vt100 (w/advanced video),

The guru-24 is a 24-line Ann Arbor terminal. There are several other definitions for variations on this terminal, so this one has no other alias. Otherwise, vertical bars separate multiple match strings for egrep's search. buildinfo must escape the vertical bar characters with backslashes for egrep to find them.

This looks like a case where one could use the egrep ? regular expression metacharacter, which means only match zero-or-one of the previous expression or character. Thus, given the T variable from the sed expression, the following egrep expression must work:

\|?${T}\|?

This says to look for zero-or-one vertical bar followed by the T variable's value, followed by another zero-or-one vertical bar. That matches a $T with a leading bar only, a trailing bar only, or both. Unfortunately, it also matches neither because both bars could appear zero times. Such a both-zero case might match a part of the description which terminfo explicitly excludes from the aliases when present. It turns out that when at least one bar is present, anything following the last bar is ignored. When bars are absent, descriptions aren't allowed -- only the single terminfo name exists. So, no need for the question mark after the second, escaped vertical bar.

buildinfo constructs the finished match string to include the alternative case for no aliases. That requires ${T}, followed by a comma, to be the entire contents of the description line. Finally, egrep's multiple expression searching facility uses an unescaped vertical bar to separate the two expressions. MATCHSTR gets the resulting OR'd expressions, so I plug MATCHSTR into egrep, quoting it to be sure that no remaining special characters confuse the shell.

Building Two Files Simultaneously

Adding to the alias list file dynamically became logically inseparable from infocmp's building of the terminfo.src file. Every new name and all its aliases had to go into the alias list as soon as an alias was discovered missing because the next name could be another alias that didn't deserve entry into the source file. The only way to get the alias list was to run infocmp. Two runs of infocmp -- one to get the name's aliases and another later to output the name's source -- would execute much too slowly.

Since I was running infocmp inside the one-file-at-a-time loop anyway, there was only so much I could do to make it run faster. Solving the simultaneous file updating problem requires understanding that the alias information coming from infocmp is a small piece of the total infocmp data going to the master source file. If I could send the entire infocmp output to the source file and still pipe it to another egrep to extract aliases, I'd have the problem solved. Redirecting piped data to a file and to the standard output is exactly what tee does.

tee(1) is one of the simplest utilities provided with UNIX. It's so simple, I show a version of it to beginning C programming students when introducing I/O redirection programming. Many people wonder who would bother using this apparently trivial utility.

tee duplicates snapshots of piped data. Pipe infocmp's output to tee. tee's -a option appends data instead of overwriting data to the output file. Once tee finishes, the standard output duplicate continues through the pipe to egrep, which uses the regular expression '^[^ # ]' to extract all lines not beginning with a space, pound sign comment, or tab. That expression extracts only alias description lines and appends them to the alias list file. You may not need tee often, but when you do, you'll be very thankful for it.

Gotcha!

buildinfo, as shown in Listing 1, works well, although the once-per-file loop does run slowly, as expected. However, I ran it on a Sun4 using Solaris 5.4 and it averaged 0.83 seconds per entry to build a master source file of 798 entries. I'm sure I could get it running faster in C with more intelligent data structures, but I won't likely run this script often enough to justify conversion. Once the source file is built, I expect to add terminfo source to it, not precompiled entries. I expect to rerun buildinfo on the same source file only when combining all entries for several systems. Still, infocmp didn't want to make this script too easy.

buildinfo surprised me on one system I tested. Notice that buildinfo runs through all the terminfo directories beginning with lowercase letters. Most entries have aliases beginning with a lowercase letter, so by running through those first, you can pretty well ensure that all other directories' entries are already in the alias list. Any others added after the z directory finishes, such as those beginning with a digit or uppercase letter, must not have been entered already. Few entries follow z. I was surprised by one system that added "ANSI" last. buildinfo added "ansi" much earlier in the run.

Checking, I found that the ansi and ANSI terminfo entries were identical. The ansi definition line showed no ANSI alias, so buildinfo correctly added it. How could there be an ANSI filename without an ANSI alias, with otherwise identical entries? Further examination showed only one link for the ansi file, but three for ANSI. Using find's -inum option to search for ANSI's inode in other terminfo directories, I discovered that the other two links were not in /usr/share/lib/terminfo.

Using ncheck(1M) to search throughout the current filesystem, I found the other two ANSI links in a hidden directory tree containing minimal system terminal definitions. Apparently some installation procedure had forced those entries in, using their names as they appeared in the hidden directory, even though its own name wasn't an alias entry in its own definition.

buildinfo, seeing such a mismatched name, figures that that name requires an entry. It adds an entry without discovering that the entry already exists in the master source file. buildinfo wasn't designed to anticipate such silliness. Armed with this information, I searched through the terminfo.src file to find how many duplicates had been caused by this problem. The following pipeline finds such duplicates:

egrep '^[^ #    ]' terminfo.src |
sort |
uniq -c |
egrep -v '^ *1 '

The first egrep searches for lines beginning with anything other than a space, pound, or tab. Only description lines containing aliases show up. Sorting those description lines puts the duplicates together, so uniq can find them. uniq's -c option shows the count of repeats in the sorted file, even if the line only appears once. A final egrep eliminates all lines beginning with a one count, which may be preceded by spaces but is always followed by a space. That leaves any larger numbers beginning with 1, such as 10. While I doubt the problem creates that many duplicates, I saw no need to take chances.

My sample output looked like this:

2 ansi|generic ansi standard terminal,
2 hp|hewlett-packar|hewlettpackard,
2 wy99gt-w-vb|wy99gt-wvb|wyse99gt-wvb

Not as bad as I feared -- only three duplicates. To eliminate them, I edited the terminfo.src file, searched for each one, and eliminated all those after the first. I then had to decide whether to eliminate the extra files in MASTERDIR. If you eliminate the extra files, you may want to add the extra names as aliases to the one entry you keep. Then, either manually create the now-missing hard links or rebuild using tic to create the links, in case any other software depends on them.

I could have added code to buildinfo to detect this, but I hesitate to slow the program down further just to handle a few problem files. Instead, I added the duplicate-detection pipeline to the end of the program. That way, the user knows when this problem appears.

About the Author

Larry Reznick has been programming professionally since 1978. He is currently working on systems programming in UNIX, MS-DOS, and OS/2. He teaches C, C++, and UNIX language courses at American River College and at the University of California, Davis extension.