Creating terminfo Source from Binaries
Larry Reznick
A UNIX system I recently started using didn't have the
binary terminfo
entry I needed for my remote login. Looking further,
I found that
that system didn't have a source file for the installed
terminfo
entries. A missing terminfo source file wasn't unusual
for
systems I've used and administered -- some have them,
some don't.
But I needed an entry or I was going to get stuck with
a dumber terminal
emulation.
Another system I use had the entry I needed, but didn't
have a terminfo
source file either. I could run infocmp(1M), capture
the entry
I needed, bring it over to the new system, and install
it, but I've
learned from experience that each system will inevitably
be missing
entries that the other system has.
Rather than solve my little problem, I decided to solve
the more general
problem of a lack of terminfo source code. To accomplish
this
I created a script that turns every compiled terminfo
entry
into its equivalent source. I group the sources into
a single source
file called terminfo.src, bring the resulting source
file
from one system to the other, then run the same script.
The resulting
source file combines the source for entries from both
systems, creating
a master terminfo source file. Any time I find or create
a
new entry, I add its source to this master. I run the
script again
on other systems to extract sources from their binaries
and add them
to the master source file automatically. buildinfo (Listing 1)
is the terminfo source creation script.
Building terminfo.src
infocmp creates source code from an existing terminfo
binary file. If you don't give infocmp the name of an
entry,
it takes the name in the TERM variable and prints that
name's
source. Otherwise, given only one name, it shows the
source for that
named entry if there is one. Given two names, it shows
the differences
between the two entries. Because of this two-name format,
I couldn't
simply give infocmp a long list of names in one command
line.
infocmp uses its -I option when fewer than two arguments
are given. When two arguments come, it defaults to -d
to compare
the two. Using an explicit -I option, I could give several
names at once, but then I'd have no reliable way of
knowing that an
entry has already appeared.
terminfo allows terminal name aliases by separating
aliases
with a vertical bar (|) in the first line of the definition.
When
tic(1M), the terminfo compiler, creates a binary from
the source definition, it creates one binary file and
then hard-links
each alias to the binary file. buildinfo has to know
when
a terminal entry already exists in the source file so
that it won't
duplicate the entry.
Searching through the terminfo.src file for duplicate
entries
would take a long time. Each entry defines the terminal's
characteristics,
so the source file containing all these entries is huge.
To shorten
the search, buildinfo creates a separate file composed
only
of the single lines holding the aliases. Such a list
file is significantly
smaller than the master file. In one example, 740 terminfo
entries created a combined source file 664,976 bytes
long, while the
alias list file was only 34,478 bytes. Given a file
at five percent
of the source file's size, the search potentially runs
20 times faster
than it might run through the master file, barring search
optimizations.
For the alias list file to work, I had to add an entry
to the alias
list for each new alias I found. Any terminfo name --
even
the very next one evaluated -- could be an alias for
an entry already
found. buildinfo couldn't use infocmp's -I
option to build a list of names because that generates
all requested
entries despite duplicate aliases.
I considered building a name list in a variable. I could
add each
new name to the variable by using
ADDLIST="$ADDLIST $NEWNAME"
After ADDLIST had been built, I could execute
infocmp -I $ADDLIST >>terminfo.src
Unfortunately, the full solution isn't this simple.
I
can't check on whether a new name already exists in
the list file
unless the previous aliases are present. Putting new
aliases in the
list requires running infocmp an extra time to get the
alias
line. Running infocmp twice for each name slows down
the program
too much.
Instead, I could run infocmp once, then put each new
entry
into terminfo.src and in the alias file as soon as buildinfo
knows the entry belongs in the source file. But first,
I had to resolve
a peculiar searching problem.
The Great Alias Search
buildinfo's primary idea is simple: for each filename
in all
the /usr/share/lib/terminfo/? directories, stored in
the MASTERDIR
variable, see if the name has already appeared in the
alias list.
Use egrep, the fastest of the grep family, to search
the alias list. If the name appears, skip it. If the
name is missing,
use infocmp to add it to the source file and to the
alias
list. The real problem comes in searching through the
alias list.
Aliases frequently contain two characters special to
egrep:
plus signs (+) and vertical bars (|).
Alias names sometimes contain plus signs to identify
special variations
that add features to another entry. Here are several
alias lines containing
plus signs:
adm3a+|3a+|adm3aplus|lsi adm3a+,
ansi+arrows,
ansi+cup,
ansi+erase,
Notice that the first entry, using a plus to identify
the terminal's brand name, spells out the word "plus"
in one
alias, but uses plus signs in two other aliases. egrep
uses
the plus sign as a regular expression metacharacter
that means search
for one-or-more appearances of the previous expression
or character.
buildinfo must escape the plus signs so that egrep
can find them as simple characters.
buildinfo fixes the plus signs using the following line:
T=`echo "$TIFILE" | sed 's/+/\\\\+/g'`
TIFILE contains the current terminfo
file's name, which might have a plus. Using echo to
redirect,
sed can operate on the alias name stored in the variable.
The idea of sed's expression is to substitute a plus
(+) with
an escaped plus (\+). This substitution is applied globally
through
the filename in case more than one appears. To emit
a single backslash
in front of the plus, sed needs to see a doubled backslash
(\\+). You can see this yourself with the following
Bourne shell command
lines:
tf="a+b"; echo $tf
echo $tf | sed 's/+/\\+/g'
unset tf
However, this entire sed replacement is happening
in a subshell because of the back-apostrophes. Back-apostrophes
assign
sed's output to the temporary T variable. Once the
subshell is finished, the current shell completes the
assignment by
analyzing the rest of the command line, now composed
of T=\+
when only two backslashes are used. Bourne shell naturally
removes
the only backslash and the expression is left with the
same plus it
started with. No progress.
So, Bourne shell needs to see doubled backslashes to
emit one backslash,
just as sed did. That means sed must have quadruple
backslashes. sed takes the four backslashes and turns
them into two,
outputting\+ for every input + instance. The shell, now seeing the
two backslashes, delivers one backslash and plus where
originally
only a plus appeared.
The vertical bar separating aliases is another special
egrep regular
expression character. An alias could be the first in
the list, so
it wouldn't have a leading bar. The last alias would
have a leading
bar but wouldn't have a trailing bar. An alias in the
middle would
have both a leading and a trailing bar. However, not
every terminfo
entry has aliases. A single name has no bars unless
a description
follows. Without a description, the name terminates
with a comma.
Here are several example alias lines:
ansi|generic ansi standard terminal,
guru-24,
ibmpc|ibm-pc|ibm5051|5051|IBM Personal Computer,
vt100|vt100-am|dec vt100 (w/advanced video),
The guru-24 is a 24-line Ann Arbor terminal.
There are several other definitions for variations on
this terminal,
so this one has no other alias. Otherwise, vertical
bars separate
multiple match strings for egrep's search. buildinfo
must escape the vertical bar characters with backslashes
for egrep
to find them.
This looks like a case where one could use the egrep
? regular
expression metacharacter, which means only match zero-or-one
of the
previous expression or character. Thus, given the T
variable
from the sed expression, the following egrep expression
must work:
\|?${T}\|?
This says to look for zero-or-one vertical bar followed
by the T variable's value, followed by another zero-or-one
vertical bar. That matches a $T with a leading bar only,
a
trailing bar only, or both. Unfortunately, it also matches
neither
because both bars could appear zero times. Such a both-zero
case might
match a part of the description which terminfo explicitly
excludes from the aliases when present. It turns out
that when at
least one bar is present, anything following the last
bar is ignored.
When bars are absent, descriptions aren't allowed --
only the single
terminfo name exists. So, no need for the question mark
after
the second, escaped vertical bar.
buildinfo constructs the finished match string to include
the alternative case for no aliases. That requires ${T},
followed
by a comma, to be the entire contents of the description
line. Finally,
egrep's multiple expression searching facility uses
an unescaped
vertical bar to separate the two expressions. MATCHSTR
gets
the resulting OR'd expressions, so I plug MATCHSTR
into egrep, quoting it to be sure that no remaining
special
characters confuse the shell.
Building Two Files Simultaneously
Adding to the alias list file dynamically became logically
inseparable
from infocmp's building of the terminfo.src file.
Every new name and all its aliases had to go into the
alias list as
soon as an alias was discovered missing because the
next name could
be another alias that didn't deserve entry into the
source file. The
only way to get the alias list was to run infocmp. Two
runs
of infocmp -- one to get the name's aliases and another
later to output the name's source -- would execute much
too slowly.
Since I was running infocmp inside the one-file-at-a-time
loop anyway, there was only so much I could do to make
it run faster.
Solving the simultaneous file updating problem requires
understanding
that the alias information coming from infocmp is a
small
piece of the total infocmp data going to the master
source
file. If I could send the entire infocmp output to the
source
file and still pipe it to another egrep to extract aliases,
I'd have the problem solved. Redirecting piped data
to a file and
to the standard output is exactly what tee does.
tee(1) is one of the simplest utilities provided with
UNIX.
It's so simple, I show a version of it to beginning
C programming
students when introducing I/O redirection programming.
Many people
wonder who would bother using this apparently trivial
utility.
tee duplicates snapshots of piped data. Pipe infocmp's
output to tee. tee's -a option appends data
instead of overwriting data to the output file. Once
tee finishes,
the standard output duplicate continues through the
pipe to egrep,
which uses the regular expression '^[^ # ]' to extract
all
lines not beginning with a space, pound sign comment,
or tab. That
expression extracts only alias description lines and
appends them
to the alias list file. You may not need tee often,
but when
you do, you'll be very thankful for it.
Gotcha!
buildinfo, as shown in Listing 1, works well, although
the
once-per-file loop does run slowly, as expected. However,
I ran it
on a Sun4 using Solaris 5.4 and it averaged 0.83 seconds
per entry
to build a master source file of 798 entries. I'm sure
I could get
it running faster in C with more intelligent data structures,
but
I won't likely run this script often enough to justify
conversion.
Once the source file is built, I expect to add terminfo
source
to it, not precompiled entries. I expect to rerun buildinfo
on the same source file only when combining all entries
for several
systems. Still, infocmp didn't want to make this script
too
easy.
buildinfo surprised me on one system I tested. Notice
that
buildinfo runs through all the terminfo directories
beginning with lowercase letters. Most entries have
aliases beginning
with a lowercase letter, so by running through those
first, you can
pretty well ensure that all other directories' entries
are already
in the alias list. Any others added after the z directory
finishes,
such as those beginning with a digit or uppercase letter,
must not
have been entered already. Few entries follow z. I was
surprised
by one system that added "ANSI" last. buildinfo
added
"ansi" much earlier in the run.
Checking, I found that the ansi and ANSI terminfo
entries were identical. The ansi definition line showed
no
ANSI alias, so buildinfo correctly added it. How could
there be an ANSI filename without an ANSI alias, with
otherwise identical entries? Further examination showed
only one link
for the ansi file, but three for ANSI. Using find's
-inum option to search for ANSI's inode in other terminfo
directories, I discovered that the other two links were
not in /usr/share/lib/terminfo.
Using ncheck(1M) to search throughout the current filesystem,
I found the other two ANSI links in a hidden directory
tree containing
minimal system terminal definitions. Apparently some
installation
procedure had forced those entries in, using their names
as they appeared
in the hidden directory, even though its own name wasn't
an alias
entry in its own definition.
buildinfo, seeing such a mismatched name, figures that
that
name requires an entry. It adds an entry without discovering
that
the entry already exists in the master source file.
buildinfo
wasn't designed to anticipate such silliness. Armed
with this information,
I searched through the terminfo.src file to find how
many
duplicates had been caused by this problem. The following
pipeline
finds such duplicates:
egrep '^[^ # ]' terminfo.src |
sort |
uniq -c |
egrep -v '^ *1 '
The first egrep searches for lines beginning
with anything other than a space, pound, or tab. Only
description
lines containing aliases show up. Sorting those description
lines
puts the duplicates together, so uniq can find them.
uniq's
-c option shows the count of repeats in the sorted file,
even
if the line only appears once. A final egrep eliminates
all
lines beginning with a one count, which may be preceded
by spaces
but is always followed by a space. That leaves any larger
numbers
beginning with 1, such as 10. While I doubt the problem
creates that
many duplicates, I saw no need to take chances.
My sample output looked like this:
2 ansi|generic ansi standard terminal,
2 hp|hewlett-packar|hewlettpackard,
2 wy99gt-w-vb|wy99gt-wvb|wyse99gt-wvb
Not as bad as I feared -- only three duplicates. To
eliminate them, I edited the terminfo.src file, searched
for
each one, and eliminated all those after the first.
I then had to
decide whether to eliminate the extra files in MASTERDIR.
If you eliminate the extra files, you may want to add
the extra names
as aliases to the one entry you keep. Then, either manually
create
the now-missing hard links or rebuild using tic to create
the links, in case any other software depends on them.
I could have added code to buildinfo to detect this,
but I
hesitate to slow the program down further just to handle
a few problem
files. Instead, I added the duplicate-detection pipeline
to the end
of the program. That way, the user knows when this problem
appears.
About the Author
Larry Reznick has been programming professionally since
1978. He is currently
working on systems programming in UNIX, MS-DOS, and
OS/2.
He teaches C, C++, and UNIX language courses
at American River College and at the University of California,
Davis extension.
|