Article

USENET ELM: A Case Study in Portability between UNIX Systems

Sydney S. Weinstein

The diversity of UNIX systems requires "Universal UNIX Applications" to be as portable as possible. The attempt to keep one such application -- USENET Elm -- portable as both UNIX and C have evolved has required constant effort and provides a useful case study of UNIX portability issues.

Dave Taylor wrote Elm in the mid-1980s while he was working at Hewlett-Packard, then in 1987 released it, with HP's blessings, to the USENET community. Like much freely distributable UNIX software, Elm is released as source code compiled by the user or system administrator. Thus portability of the system at the source code level is mandatory.

Elm, in the UNIX vernacular, is a Mail User Agent (MUA). It displays the contents of a mailbox or folder (sequential text file containing mail messages), allows display of individual mail messages from the mailbox, accepts replies to those messages, and allows for generation of new messages for the Mail Transport Agent (MTA) to deliver. Elm does not deliver the messages; instead, it passes them to the MTA, which handles the routing and delivery.

Early UNIX MUAs were line-oriented, as the standard terminal in use was a hard-copy printing terminal. With the switch to CRT-based terminals, UNIX applications moved from line- to a screen-orientation. As one of the early screen-oriented MUAs, Elm incorporated the best features of the line-oriented MUAs available in the mid-1980s and extended the concept to a full-screen, menu-driven system. Designed to be simple to use and "intuitive," yet not so restrictive as to frustrate sophisticated users, Elm is currently used by approximately 250,000 individual users, on over 20,000 systems.

Original Elm Environment

Elm was initially developed with HP-UX, a port of the AT&T System V.2 version of UNIX. These systems used a K&R-style C compiler (ANSI C was not yet a glint in someone's eye). Elm was coded in the "loose" style common to software not intended to be ported between very diverse systems.

AT&T System V.2 Dependencies

Hewlett-Packard based HP-UX on the Motorola MC680x0 family of processors. Processors in this family share certain characteristics:

32-bit word length
32-bit integer length (int type)
32-bit argument passing (all arguments less than 32 bits long are converted to 32-bit values when placed on the stack as arguments to functions)
32-bit pointer length
Large linear addressing space with no segmentation

The common length for the pointer data type, argument passing, and the int data type allowed for some very loose programming practices, the most common being to intermix the int and pointer data types freely, on the assumption that an int can always hold a pointer value. The common length also means that an integer/character argument passed to a function could always be considered as an int. Casting arguments to convert the types explicitly was not necessary.

The large linear addressing space allowed large buffers to be placed on the stack and used to hold data values without concern for overflow. If overflow appeared likely, the size of the buffer could be increased -- there was plenty of room.

Because AT&T UNIX System V.2 limited filenames to fourteen characters, the individual elements of a full path name (the filenames) were short and the space reserved to hold path names was also very small. In addition, Elm used the C library provided with this UNIX since, at the time, no other version of UNIX had a different C library.

HP Function Keys

The original Elm was developed on and hard-coded to support Hewlett-Packard terminals. These terminals used their own keyboard layout with their own set of function keys. They also allowed for labeling the function keys on the screen directly above the keys themselves. Since the HP method is not an industry standard, the decision to hard-code support for terminals rather than use the termcap function key fields has created even greater portability problems.

Dave's Own Curses

A common library package called curses generally performs screen updating in UNIX programs. Dave Taylor, Elm's creator, implemented his own, simpler, version of the curses package. He handled only the low-level terminal control routines, such as cursor move, up-line, down-line, and clear screen and left all the actual screen intelligence to his display routines. Its limited interaction with the curses package makes Elm very portable to other systems. At the same time, however, the code's low-level nature makes it very difficult to modify the screen code or add features. Instead of hiding the screen intelligence in the curses routines, Dave distributed it throughout many modules.

Dave's curses package did make use of UNIX's underlying terminal capability database. He used the calls from the older termcap system instead of the newer System V.2 terminfo system. The termcap/terminfo database tells applications programs how to perform a common set of functions on many different types of terminals. It allows UNIX tasks to be portable between terminal types.

In general, if you are writing a "universal UNIX application," you can best achieve portability by using the system configuration libraries, such as termcap/terminfo. Use of these facilities makes your program immediately portable to all systems and equipment to which anyone has ported those facilities. In the case of termcap/terminfo, your screen-oriented program can immediately function on whatever types of CRT terminal are in use.

Porting to BSD-Type Systems

Elm's first major port was from the HP-UX version of AT&T UNIX System V.2 to the other major variant of UNIX, the Berkeley Software Distribution (BSD). This was (and still is) a logical first major port, especially since one of the major UNIX minicomputers in the mid-1980s was the DEC VAX. The University of California at Berkeley had ported an earlier version of UNIX to the VAX and added support for page demand virtual memory and extended networking. This version became known as BSD UNIX.

The DEC VAX is very similar to the MC680x0 family. Both share the 32-bit features and large linear addressing space listed earlier, but the DEC VAX orders its bytes in the reverse order of the MC680x0. Since each processor is internally consistent, this difference becomes significant only if a memory region is addressed as two different data types. In the case, say, of a memory area addressed both as a text string and as an integer value, the integer value 0x41424344 (1,094,861,636) would be ABCD on the MC680x0 family and DCBA on the VAX family.

For purposes of portability, it is necessary to make sure no data structure refers to the same area of memory with two different fundamental types. All strings must be passed as string pointers. The short cut of placing a couple of characters into an int and passing the int will no longer work: the characters would come out backwards on the VAX family. In addition, code must examine union data structures to see which fundamental type is being used. Further, if the union is used to overlay two fundamental data types, the code must take into account the byte ordering of the system on which it is running.

Failure to implement these subtle coding changes will not cause compiler errors or link problems; instead, the result will be strange behavior at execution time. The program could crash with an invalid pointer, for example, or it could get a cursor movement string out of sequence and scramble the display. These types of problems are very difficult to track down.

BSD 4.2/4.3 vs. AT&T System V.2

For the application programmer, the major differences between the BSD UNIX family and the AT&T UNIX family reside in the #include files and C runtime libraries. Each team developed its own runtime library, with the result that similar routines have different names. Also, identical data structures ended up in different #include files. The differences show up most notably in the string and memory manipulation functions (see Table 1). In particular memory block arguments to the memcpy/memcmp routines are backwards from the same arguments to the bcopy/bcmp routines.

Not only are the string routines defined differently, but the header files that declare them have subtly different names. The AT&T UNIX name is <string.h>, while under BSD UNIX, it's <strings.h>.

As a further complication, some routines exist in only one of the systems. Note that memset is generic, and the 0 used to initialize the block of memory is passed as an argument. bzero, on the other hand, can only set a block of memory to zero. Several other of the string functions included with System V.2 do not exist on early, or "pure," BSD systems. These include most of the library routines that start with the prefix str, as documented on the string(3) manual page. These routines, at least, will show up as missing header files at compile time or undefined externals at link time, making these types of problems much easier to track down.

Only rarely are functions with the same name in both versions used for different purposes. However, many similar commands take different arguments in the two versions, affecting shell scripts and spawned commands.

Long vs. Short Filenames

One of the more annoying differences between the older AT&T UNIX versions and the BSD versions is the AT&T 14-character filename limit. This difference normally creates problems when porting from BSD to AT&T (if filenames are longer than 14 characters), but can also cause difficulties when porting in the opposite direction. Usually, in this case, the problem deals with buffer lengths. Most programs written for systems without the flex-file names (the name for the longer file names used in BSD systems) leave relatively short buffers for path names. With the longer filenames these buffers often overflow, causing name truncation or, worse, other data items on the stack to be overwritten.

Since the filenames are of different lengths, it follows that the directory structures must also differ. For this reason, the directory access functions differ in the data types of their arguments. This difference can also result in programs that compile correctly but do not produce the expected results. Symptoms include directory listings within the program that appear to be missing files or that show garbage filenames or the inability of the program to find files in the directory.

Mailbox Locking

Another component of UNIX that was not yet standard when the AT&T and BSD split occurred was file locking, and both versions developed their own method of handling interlocks to prevent two processes from writing to the same file. The original mail systems created a semaphore file in the mail spool directory to indicate their locking of the spool file. This scheme worked well for local systems, but required that the mail user agent and the mail transport agent have permission to create files in the spool directory. The steps in locking of this type are:

Attempt to create a file of the name LCK..name in the spool directory. If the create succeeds, you have locked the file.

If the create fails, then someone else has locked the file already. If the iteration limit has not been exceeded, sleep for a short duration, then return to step one to try again.

If the iteration limit has been exceeded, report the error to the user and, optionally, just ignore the lockfile.

Later revisions of this method placed the process id (PID) of the owning process in the lock file. When the create failed, the file could be opened for read and a system call would determine if the lock was stale (the process that owned it no longer existed). If the lock was stale, it would be removed and the locking process would be repeated.

AT&T System V.2 used this revised method for mailbox locking. BSD systems started with this locking protocol, but due to atomic file creation problems with NFS (Network File System), switched to locking the file only using the kernel file locking system call. Newer UNIX System V.4 systems use a system call for locking the file that is different from that used by the older BSD systems.

Using the wrong locking technique for the system results in a window of time where two tasks can write to the mailbox. This can cause garbled messages, lost messages, or truncated mailboxes. If your program opens a file for writing, you must consider how file locking is performed on all systems to which your application will be ported.

Changes in the Port

No method exists for writing a single set of code that can handle both the System V.2 and the BSD versions of UNIX. However, the #ifdef command of the C preprocessor makes it possible to integrate both versions into the same source files. Elm used this method to provide a single version of source code for both systems. The initial #ifdef symbol was BSD and was passed to the C compiler via the Makefile. ifdefs then handled the code required for the different serial communications systems calls (setting up the serial line communications modes), different string routines, and different header files. In addition, this port revealed some of the weaknesses regarding buffer sizes mentioned earlier. During the port, all the buffer sizes were adjusted to fit the needs of the larger of the two systems.

Elm did not run into any problems with byte ordering at this stage of the port. However, byte order did become a problem once it became possible to share the Elm alias database between NFS-linked systems.

An unexpected surprise arose in the different implementations of the <ctype.h> macros for character manipulation. The standard System V.2 macros toupper and tolower, which convert a character's case, would change only lower- or upper-case characters, respectively. If the character passed to the macro was not the appropriate case, no change was made. For example, in the statement

c = tolower('a');

under System V.2, c would contain the lower case letter `a'. Under BSD, the macro is implemented as

#define tolower(c) ((c) - 'A' + 'a')

This macro turns the lower case a (0x61) into a SOH code with the eighth bit set (0x101). The two macros had to be redefined as follows to make the code compatible for both System V.2 and BSD:

#define	tolower(c)
(isupper(c) ? ((c) - 'A' + 'a') : c)

The isupper macro now protects the code, preventing translation of all but upper-case letters. However, this redefinition is still not fully portable. It assumes that lower- and upper-case letters are always the same distance apart in the character set as the upper and lower case 'a'. This is true for ASCII, but not for all character sets.

Heterogeneous System File Sharing

The next big portability hurdle for Elm came when systems were linked together via NFS into one common disk cluster. NFS allowed many different types of systems -- even non-UNIX systems -- to share disk partitions, and many sites mounted the users' home directories via NFS. Elm, which uses a file for global aliases, then also needed to access the private alias data across the NFS file system as well. Since the system where the file resided and the system running Elm were not necessarily of the same type, byte order imediately became an issue.

Big vs. Little Endian

The battle over the order by which to number a word's bits and bytes has often been compared to the wars waged by the Lilliputians of Gulliver's Travels over such issues as which end of the egg should be eaten first, the little or the big end. Networking forced UNIX to rise above this war and declare a truce, or at least a translator.

Since all networks need multibyte addresses to identify all of the hosts and circuits, these addresses must share a common byte order. Communication becomes impossible if a single machine is known as node 0x1234 on one system and node 0x4321 on others. The solution is to pass bytes over the network in network byte order. For TCP/IP networks, specifications issued by the Network Information Center document this order. Several macros (see Table 2) assist the C programmer in placing the bytes in that order (each routine converts one item into the proper byte ordering). Elm was adapted to store its alias tables using these routines, with the result that the table appears the same whether the machine accessing it was a "little-endian" or a "big-endian." Users whose home directory is cross-mounted via NFS can access their private alias table regardless of which type of system they are on. In addition, the global or master alias table can also be shared across systems.

NFS Locking

NFS added a degree of portability to Elm, but it also brought problems. File locking, already discussed in the section on mailbox locking, was late to be standardized under UNIX. The multiple locking methods require portable C programs to adapt their locking methods to each system's standard. NFS makes that situation a bit worse. Since NFS is stateless, cross-system locking cannot be defined using the standard method (lockf or flock) for NFS-mounted file systems. To work around the problem where remote programs access files via NFS, some systems use a special daemon, rpc.lockd, to perform the locks locally on the system where the files actually reside. This requires the portable C program to have yet another method of locking files. At present (2.3 and 2.4), Elm does not use the lock daemon.

Coping with System Differences

As the prior sections demonstrate, many of the modifications required for portability between UNIX versions, or for that matter, between UNIX and other operating systems, require changes to the code for each system type. Yet, to maintain several versions of the same file, one for each different standard, would be impractical and would lead to problems such as inconsistent code, wasted space, and a complicated makefile procedure.

Fortunately, C provides a construct to handle these differences with a single source file.

The C preprocessor has three commands -- #if, #ifdef, and #ifndef -- that do much of the work in creating portable programs.

#if tells the preprocessor to emit the lines following the command until it reaches an #else or #endif only if the expression on the command line is true. Each symbol in the expression is evaluated based on its value at that point in the file. These are symbols, not variables, so each must be set to a value using a #define statement or the -Dsymbol=value argument to the command line.

#ifdef tells the preprocessor to emit the lines following the command until it reaches an #else or #endif if the symbol on the command line has been defined. It does not matter what value the symbol has. The symbol can be defined by a #define statement, by the -Dsymbol argument to the compiler command line with or without a value, or could have been predefined within the preprocessor itself. System manufacturers generally predefine a symbol within their C preprocessor to identify the system. This symbol is intended to delimit code that must differ for their system.

#ifndef tells the preprocessor to emit the lines following the command until it reaches an #else or #endif if the symbol has been not defined. The symbol can either never have been defined or have been cleared by an #undef command.

In all three cases, the C preprocessor will emit the lines following the command if the condition is met, causing the compiler to compile the lines on later passes. If the condition is not met, the C preprocessor just outputs a blank line for each line being skipped. When the #else command is reached, if there is one, the action is reversed. In any case, the if condition ends at the #endif command, which is required.

The conditions can be nested in such a way that a check for one symbol is conditional on the preceding check for another. However, portability requires that you nest statements in a way that all C compilers will understand. For ease of readability, it is often useful to indent nested ifdefs as

#ifdef CONDITION1
#ifdef CONDITION2
#endif CONDITION2
#else CONDITION1
#ifndef CONDITION3
#endif !CONDITION3
#endif CONDITION1

Two aspects of this construct can create problems for some compilers. First, many C preprocessors require that the # character be in the first column of the line. And, second, many do not allow symbols on the #else and #endif lines. To ensure portability, type the lines as follows

#ifdef CONDITION1
#	ifdef CONDITION2
#	endif /* CONDITION2 */
#else /* CONDITION1 */
#	ifndef CONDITION3
#	endif /* !CONDITION3 */
#endif /* CONDITION1 */

Since ifdefs are often nested to many levels and the #else or #endif might not be close to the command which it affects, placing the condition name as a comment on the #else and #endif lines helps to clarify the structure.

Elm has always based its system portability changes on ifdefs, and as the number grew, the comments were added to make the range of each ifdef more apparent. However, this proliferation of ifdefs leads to the next problem, what is the proper condition to use?

How to Use #ifdef

When Elm was first ported, all of the changes required for the BSD version were grouped under the symbol BSD. This led to code fragments like

#ifdef BSD
#	define strcpy index
#	define strchr rindex
#	include <sys/pwd.h >
#	undef tolower
#	undef toupper
#else
#	include <wd.h>
#endif

Such constructs allow compiling the BSD version with just the symbol -DBSD added to the CFLAGS= line of the makefile. Problems arose, however, as Elm was ported to systems that were hybrids of the pure System V.2/V.3 and BSD 4.2/4.3 versions. No longer were all of these changes required all of the time.

A better approach is to define a symbol for each portability change itself, rather than for the system as a whole, and to define these symbols as close to the name of the condition as possible. If the previous code fragment had been written as

#ifdef HAS_INDEX
#	define strcpy index
#	define strchr rindex
#endif
#ifdef PWDINSYS
#	include <sys/pwd.h>
#else
#	include <pwd.h>
#endif
#ifdef TOLOWER_MACRO
#	undef tolower
#	undef toupper
#endif

then, as the different operating system versions required different combinations of changes, the CFLAGS= line could be changed as needed. If the CFLAGS= line in the makefile becomes too complicated, then in one global header file, included first in all modules, a code sequence similar to

#ifdef ATT_SVR2
#	undef HAS_INDEX
#	undef PWDINSYS
#	undef TOLOWER_MACRO
#endif

#ifdef SUNOS_41
#	define HAS_INDEX
#	define PWDINSYS
#	define TOLOWER_MACRO
#endif
#ifdef HPUX_8
#	ifdef HAS_INDEX
#	undef PWDINSYS
#	undef TOLOWER_MACRO
#endif

could handle each of the combinations with only a single flag on the CFLAG= line of the makefile.

Using this type of code sequence in the include file, porting to a new operating system would only require listing the features the system supports. Of course, any new quirks of that operating system would generate new names and changes to the code in the rest of the program. But still, the makefile would require only the name of the version on its CFLAGS= line.

A side effect of this change is that there are now many, if not hundreds, of symbols created to ensure the widest portability, and it becomes very difficult to determine the proper values for a new operating system version/port for each of these symbols. But with proper coding style, help is on the way later in this article in the section on Metaconfig.

The Merge of System V and BSD

The merger of the System V and BSD standards into the new System V Release 4 standard has really placed a wringer on the choice of ifdef. Besides changing the location of many #include files, this standard splits into separate conditions many of the old combinations of things that used to go together as a single ifdef. In particular, SVR4 supports many items using both styles, and sometimes one is better than the other and other times, not.

Elm used to group most of the BSD compatibility changes together. Now that SVR4 has most of those items within the System V defines, these ifdefs had to restrict their range once again, making it all the more important to choose the ifdef symbol to cover as little as possible -- preferably just the single change required for the port. Then, when the underlying operating system changes, at worst the symbols will simply need to be defined/undefined to adapt.

Metaconfig and Configure

Larry Wall has written many programs for C programmers and has shared them with the USENET community. All of the programs run on many different types of UNIX operating systems. To simplify porting, Larry wrote a shell script called Configure, for his rn program (a USENET network news reader) that tried to determine automatically the values needed for the various ifdef symbols. Where the script could not determine the answer automatically, it would ask for "local preference" items. To automatically configure the software, you just typed Configure at the shell prompt.

The Configure script would identify the location of needed commands and libraries, check the contents of those libraries to determine which functions were available, and ask the user for local preference items. From these, an #include file was built and included into each source file. The header file contained the results of the program and function checks as #define SYMBOL or #undef SYMBOL lines. It also included the preference items as #define PREFERENCE_SYMBOL value lines.

Coding the program to take advantage of Configure's symbols allowed immediate configuration at the source level. However, writing the Configure script by hand for each new program was tedious. Since most of it was boilerplate, and whole sections could be used by many different programs, this script was a perfect tool for automatically generating the ultimate script. Since Larry was working on a very large program with many portability changes, he used the program as both the reason to develop the tool and as a method of developing it. The program was Perl, and the tool he developed is Metaconfig.

Metaconfig is a large Perl script that scans a list of files, called a manifest, looking for all symbols used on #if type lines in the .c, .h, and .y files, and all shell variables used in the .SH files. These symbols form the wanted list. Using these symbols, Metaconfig then searches a library of shell script fragments, called units, for those units that define the symbols on the wanted list. Each of the units also lists the other units it requires, if any. All of these units are then combined in an order to satisfy the dependencies, and placed with a common start and end code to form the shell script Configure.

Since the units are common and reusable, a library of units was quickly developed that Metaconfig can use for other programs. Each unit is placed in a file named by combining the primary symbol name with a .U suffix. These units form the master library used by Metaconfig.

Each program also has a local library of units which are similar to the master units, but incorporate changes to the master library equivalent unit. The local override units are given the same name as the master library unit they replace. When Metaconfig is run, it generates a message specifying which local units will override the equivalent units from the master library.

In addition to the override units, the local library includes units that are specific to a program and not considered useful to other programs. These custom unit files are also named by combining the primary symbol name and a .U suffix.

Metaconfig units and the symbols they define fall into three categories:

Symbols that are automatically determined by the Configure script and cannot be overridden by the user.

Symbols that are automatically determined by the Configure script, but can also be overridden by the user. The automatically determined value becomes the default value the first time the script is run. The answer given the last time the script was run is the default value for each subsequent time the Configure script is executed.

Symbols that are local preference items. No automatic value is possible. Sometimes the unit's code specifies a suggested value for a default value the first time Configure is run. Configure uses the answer from the prior run as the default for each subsequent run.

An example of the first case would be to check for certain functions in the C library. Configure automatically determines what C functions exist in the libraries chosen to link the application. This list is available via a shell function and is used to define symbols based on the availability of individual functions.

Listing 1 shows d_strcspn.U, a unit from Elm's local Metaconfig library, which checks the existence of certain C functions. The lines preceded by a ? are control lines for Metaconfig.

RCS-type lines are comment lines for use by the Revision Control System and contain version tracking information.

MAKE-type lines contain a list of shell symbols defined in this unit, followed by a colon (:), and then the list of symbols/functions this unit requires to be already defined. This second list is the dependency list. The d_scrcspn.U unit defines two shell symbols, d_strspn and d_strcspn, and requires that the shell symbols and libc already be defined. The first symbol before the colon is the primary symbol. The unit's filename must match this symbol with a .U suffix.

The second MAKE line defines the types of operations the dependency makefile requires for this unit (the definition of these types is too long to be included here, but is explained in the Metaconfig documentation).

S-type lines are extracted to form documentation on the shell symbols available in the different unit files. The metaconfig source includes a program that automatically extracts these lines from all of the units to produce a document on the available symbols.

C-type lines function similarly to S-type lines, but for symbols defined for use in C code rather than in shell scripts. Once again, the Metaconfig source includes a program that automatically extracts these lines and forms a document on all of the available C preprocessor symbols.

H-type lines are used by Metaconfig to automatically generate the configuration include file.

The remainder of the lines comprise the shell script fragment. In the simple example in Listing 1, the shell script uses a fragment of shell code that is contained in the shell variable inlibc. The libc unit defines this variable, thus the libc dependency on the first MAKE-type line. The inlibc function searches the name list from the C libraries to see whether the symbol in the shell variable $1 exists. If it does, the symbol in $2 is set to define. If not, the symbol in $2 is set to undef. The set command on the line preceding the inlibc call initializes $1 and $2. Using the value just set into the symbol d_strspn, the ?H-type lines will automatically produce a #define or #undef for the symbol STRSPN. The C code can then use the line #ifdef STRSPN when it needs to call the strspn C library function, and provide alternate code following a #else line.

d_internet.U (Listing 2) provides an example of the second type of Metaconfig unit, one that allows the user to input a value to override the default. The header lines are the same, but the shell script fragment is a bit more complicated. The first section uses the case construct to set the default value for the d_internet symbol based on the value in the shell variable d_internet from the prior run. If the d_internet variable is empty, or not one of the strings define or undef, the default value is set based on some conditions the shell script can check on its own. In this case, those symbols are set by other units or by shell code directly in this unit. The middle section echoes a message that explains the meaning of the symbol the user is about to define. The script then asks the question, presenting the default answer to the user. Lastly, the result the user types is checked to see how to define the shell symbol d_internet.

The last type of Metaconfig unit is used to define a user choice or local preference. The unit for these looks almost identical to the unit shown in Listing 2. The only difference is in how the default value is set when there is no prior answer to use. While d_internet.U used a value determined by the Configure script as the default, this local preference unit uses a hard-coded default directly in the shell fragment. Of course, it is still preferable to remember the answer from the last Configure run and use that as the default whenever possible.

Just as C files can directly include the .h file written by the Configure script, shell scripts and other non-C files can use the shell variables in the file config.sh created by the Configure script to adapt to the results of the Configure run. The Configure script executes all files ending in .SH in the manifest to produce the appropriate adapted file. Listing 3 shows an extract from the makefile prototype, Makefile.SH, in Elm's master directory.

The .SH files are broken into three sections. The first section, which runs up to the echo statement, locates the config.sh file, which contains all the answers obtained by the Configure script. Configure then reads this file into the current shell. The second section uses the shell variables to modify the lines with the results of the config.sh just read. The last section just adds the remainder of the file that does not need the variables substituted. In the listing, the line [...] indicates that lines were deleted from this example. The actual makefile is much larger.

By coding your program to take advantage of the existing library of units, you can achieve instant portability between most UNIX operating systems with Metaconfig. In addition, by allowing for local preferences, Metaconfig provides an easy means of customizing the distribution.

International Portability

The upcoming 2.4 version of Elm tackles a totally new problem -- international portability. The ASCII character set, which most UNIX systems use, takes advantage of the English language's 26-character alphabet to be a seven-bit code, with the eighth bit within the eight-bit byte used for parity. On most UNIX systems, internally, the eighth bit is always zero, cleared by the istrip terminal control parameter.

Eight-Bit Clean

For languages with alphabets of more than 26 characters, the eighth bit is used to extend the character set to support additional characters. Any program destined for international consumption, then, must be eight-bit clean, which means that you do not alter or clear the eighth bit of any character value, and you do not depend on all character values to be positive when viewed as signed characters. The international standard treats all characters as unsigned quantities.

Using the eighth bit to extend the character set also changes the definition of an alphabetic character. It is no longer valid to consider the range `A'-`Z' and `a'-'z' as the only alphabetical characters. All checks for the type of character should use the macros defined in <ctype.h>. It is the system's responsibility to have the proper values in this file and its associated modules in the C library to support the local character set. Because Elm has always been eight-bit clean and has always used the macros instead of direct comparisons, version 2.4 required no changes in these areas.

It's worth noting that some character sets are too large even for eight bits (the Japanese Kanji alphabet, for example, uses a 16-bit character). For purposes of international portability, your program should not assume an eight-bit character type.

NLS and Message Catalogs

Changes to messages, prompts, and commands from English to the local language represent the most significant challenge in internationalization. Since most programmers do not speak all the languages needed to please all of the potential users of their programs, how do you solve this problem?

The solution uses the concept of Native Language System (NLS) support. The X/Open standards committee, a group of computer companies, produced an NLS usable for UNIX that provides several components:

LOCALE functions for setting the desired character set and language characteristics, including bit length, collation sequence, and character attributes.
System error messages in each of the locally supported language sets.
Message catalog support.

The LOCALE subsystem tells the C runtime library which character set is in use. The user typically defines the desired character set as an environment variable. The locale functions read the variable and set up the appropriate structures and collation lists. ctype.h macros use these character attributes to determine the class of each character. The collating sequence allows the extended characters to be sorted in appropriate order, rather than be grouped at the end due to the unused portion of the character-set code space.

The user also sets the language for system error messages in an environment variable. The locale functions initialize the syserror structure with the messages in the appropriate language.

The most important change is support for message catalogs. Because most C programs, including Elm until the 2.4 release, code their messages directly into the source, a single compiled version cannot output different messages based on the language desired. Rather than requiring that messages for every supported language be coded directly into the program, solution gives the user the ability to define new message catalogs that include the text of all of the messages, translated by the user, into the chosen language. For example, to print the command scan message for calendar entries, Elm would display a message on the screen using the C code fragment

PutLine0(LINES-3, strlen(Prompt),
"Scan message for calendar entries...");;

This fragment, in English only, places the message at the bottom of the screen. A message catalog function, however, obtains the message from a file based on its message number. The file can be translated into any language so that the program can automatically speak that language. Recoding the example using the message catalog functions yields

PutLine0(LINES-3,
strlen(Prompt),
catgets(elm_msg_cat, ElmSet, ElmScanForCalendar, "Scan
message for calendar entries..."));

The function catgets reads the message catalog and loads into memory all the messages from the set ElmSet, if they are not already in memory. It then returns the text string of the message ElmScanForCalendar. If the message catalog is not open on the file elm_msg_cat, or there is no set ElmSet or no message ElmScanForCalendar, the string contained in the call is returned as the default answer.

The function that opens the message catalog, catopen(), uses the language environment variable to select the correct file from the application program's set of message catalogs, each of which contains the application's messages in a single language. The program that compiles the messages into the file also produces a C header file that defines the set and message number symbols.

Because word order rules and conventions vary among languages, a straightforward string replacement mechanism would produce garbled messages. Where an English message reads "6 messages received," for example, the message in another language might read "received 6 messages." In C, the printf function converts the numbers into text strings and builds simpler strings into complete messages. If the string message, or its foreign translation is in the variable msgs, and the string received is in the variable rcvd, then the message could be output with the printf statement

printf("%d %s %s\n", num_msgs, msgs, rcvd);

Since the arguments are passed in order on the stack, the printf function just uses them in order to fulfill its format string. To turn that message into "received 6 messages," printf must access the arguments on the stack in a different order. NLS provides for this ability with an extension to the printf function. If a format argument contains an integer followed by a $ character, that integer is interpreted as the ordinal of the argument on the stack to use for this format string. The same string would then be printed as

printf("%1$d %2$s %3$s\n", num_msgs, msgs,
rcvd);

It then becomes easy to turn the message around to say "received 6 messages" using

printf("%3$s %1$d %2$s\n", num_msgs, msgs,
rcvd);

Once again, the different format strings for these last two printf statements would be obtained from the message catalog using the catgets()function. The final printf statement would read

printf(catgets(elm_msg_cat, ElmSet,
ElmMessagesReceived, "%d %s %s\n"),
num_msgs, msgs, rcvd);

In addition, the values for the variables msgs and rcvd can also be obtained from the message catalog.

The English version does not need the $ notation as the arguments are used in their natural order. The translations in the message catalog would use the $ notation as needed.

The problem remains of writing for an operating system whose vendor doesn't support NLS. Several freely distributable programs provide NLS support, including new versions of the printf family of functions. Elm, with release 2.4, will include one such program so that users whose systems don't support NLS will still be able to compile new message catalogs for the language of their choice.

Future Portability Issues

Up to this time, Elm has supported only electronic mail interchange using UNIX-based messaging systems. These systems use the RFC-822 standard to format messages. A newer, international standard, entitled X.400, has been approved by the CCITT (the international standards body). This standard allows for a hierarchical address to any place in the world, on any computer system. And, unlike RFC-822, it has a companion standard, X.500, similar to the telephone directory white pages. The X.500 standard allows distributed directory services, which means that knowing only a name, one could look up the electronic mail address. Elm must eventually evolve beyond its purely UNIX mail roots and handle X.400 messaging systems directly, instead of behind an RFC/822-to-X.400 gateway.

The change in the UNIX market is from character-based terminals to bit-mapped terminals running Graphical User Interfaces (GUI) also has implications for Elm's development. Both of the two major GUI standards, OpenLook and OSF/Motif, use the X Windowing System. Future versions of Elm will have to support these as well as the traditional character-based interfaces. A complete redesign of Elm's user interface -- to replace menus with buttons and add support for sliders and multiple windows -- will be required.

These and other changes will wait for a rewrite after 2.4 is released. Like all programs that have evolved through a long development, Elm at some point will need to be rewritten totally to clean up convoluted code and remove some of the past assumptions. Such a rewrite provides the best opportunity to consider the portability issues that created problems in the past and to design in ways of handling them.

About the Author

Sydney S. Weinstein, CDP, CCP is a consultant, columnist, lecturer, author, professor, and president of Datacomp Systems, Inc., a consulting and contract programming firm specializing in databases, data presentation and windowing, transaction processing, networking, testing and test suites, and device management for UNIX and MS-DOS. He can be contacted care of Datacomp Systems, Inc., 3837 Byron Road, Huntingdon Valley, PA 19006-2320 or via electronic mail on the Internet/Usenet mailbox syd@DSI.COM (dsinc!syd for those who cannot do Internet addressing).