Article

Centaur: Hybrid Desktop with Manageability

Yuval Lirov, Dave Macolino, Martha Ben-Michael, and Andrew Rieger

Hybrid systems, desktops that combine UNIX and Windows NT, offer better systems performance and significant flexibility but usually cost more because of increased systems management complexity. However, gaining the advantages of multiple platforms without increasing costs is possible using uniform systems management. Centaur, a combined UNIX/NT workstation, uniformly administered under Atlas, is a working prototype for this new approach. It is currently deployed in production to 105 users spanning trading, sales, research, development, and support at a Wall Street firm. The system approach and the Atlas management techniques are presented here as potential guides to others implementing similar architectures.

Hybrid systems enable coexistence of applications on different platforms prior to their migration to the new platform, but they also introduce a new complexity aspect to system support. Increased complexity usually results in lower systems reliability and increased support costs. Specifically, SA productivity ratio can drop from 200 UNIX hosts/SA to 85-100 hybrid Desktops/SA, raising the systems support costs by a factor of 2. Moreover, as new desktops are brought to the users without taking away the old desktops, the rising number of networked computers creates network congestion problems in terms of both connection ports and network traffic. Finally, the hybrid systems expose multiple new problems at the application level, including keyboard mapping problems, divergent behavior under different X servers, different fonts, and centralized file system service synchronization (Figure 1).

Surprisingly, a new systematic approach to hybrid systems support helps achieve application transparency while maintaining traditional UNIX support productivity levels. It is based on X-hosting technology, and it provides the best of the two different systems (namely, UNIX and Windows NT) by integrating them along three dimensions: time, data, and function. Specifically, a hybrid system allows simultaneous use of shared file and print services, the ability to run both MS and X Windows applications, font sharing, common user names and passwords, and system monitoring. Centaur is an example of a hybrid system that achieves better computing services at lower costs.

Systems Management

In a hybrid environment, uniform systems management has paramount importance because of the potential economies of scale. Unfortunately, to our knowledge none of the tools available on the market today provides a cross-platform, cross-functional. and ubiquitous functionality. Atlas, a homegrown prototype of such a tool, implements most of the required aspects.

Specifically, Atlas is a global systems management platform for 3,200 hosts, 360 data servers, 10,200 batches, 2,600 backups, and more than 100 NT hosts in New York, Tokyo, and (for global applications) London. Atlas provides access to systems information via the Intranet web. The information spans the spectrum between real-time production crisis management, systems configuration, change management, and support service level accountability. For instance, in May 1997, we experienced 35 production systems outages. In 80% of the cases, Atlas notified us first, allowing proactive troubleshooting ahead of users noticing the problem. Automated crisis identification and notification are key to achieving and maintaining the productivity of 167 hosts/SA despite the recent 25% growth in systems administration workload.

Architecture - NT Desktops and UNIX Back End Servers

After considering several alternative architectures, we settled on combining Microsoft NT workstation and UNIX back-end servers with remote X-hosting technology on a native NT window manager. Centaur, our internal name for the hybrid desktop, provides an interim solution that is both cost-effective and user friendly. Only one interface is needed to access both platforms' applications and only one workstation is required on the desktop.

Figure 2 depicts the scheme for the Centaur configuration. This prototype uses the following 5 components:

NT Workstation - The user's desktop for running both NT and UNIX-native applications.
UNIX Server - File services and UNIX-native applications are run by many NT workstations from these servers. Typical ratios are five NT workstations to one SPARCstation 5 Model 170. Ratios increase using faster servers.
HummingBird Corp's eXceed - PC X Server providing remote display capability for UNIX/X Window applications.
Samba - Samba is a robust implementation of SMB services on UNIX. It provides file and print sharing to native NT clients.
Load Balancing Name Server - lbnamed tracks load on UNIX servers and dynamically updates the name server authoritative for the domain on which the NT stations are placed with an alias for the best performing host. This allows the UNIX applications to be served in parallel from multiple workstations, ensuring balanced application load. This functionality is transparent to the user.

While the commercial elements of the system and Samba are well-known to most system administrators, the load balancing name server, lbnamed, is not - and warrants a more detailed examination.

lbnamed - The Load Balancing Name Server

lbnamed is a freely distributed load balancing name server written in Perl. The distribution can be obtained from:

http://www-leland.stanford.edu/~schemers/dist/lb.tar

lbnamed provides a means of transparently distributing the compute cost of multiple users running an X application across a number of UNIX servers. lbnamed can be used as a low-cost HA (high availability) solution to clustering UNIX servers or workstations.

A host designated as the load balancing name server tracks the load on a defined group of machines. When a user invokes an X application, a request is sent to the load balancing name server. The name server resolves this request by returning the address of the machine that will yield the best performance, and the application is invoked there.

The load balancing name server consists of the following components:

The poller - The poller daemon periodically polls each of its clients regarding system load and number of users. Based upon a configurable formula, it then calculates a weighted load for each client host and dumps this data into a file.

lbcd - lbcd is the client daemon that runs on each client that is to be polled. lbcd responds to poller requests over UDP using a simple protocol (for more information, see the documentation included with the distribution).

The lbnamed daemon - Externally, the lbnamed daemon looks like a standard DNS name server, listening on named's port and providing resolutions to IP addresses requested. Internally, it maintains a hash table containing the data collated by the poller. Requests are resolved with the IP address of the least loaded system polled.

At virtually no cost, the use of lbnamed provides the Centaur model with the following characteristics:

Scalability - By simply adding new server names to lbnamed's configuration file, new UNIX servers can be integrated into lbnamed's pool of X application servers. In this manner, the inevitable increase in number of users or application compute complexity can be accommodated with minimal administrative cost.

Robustness - Because each X application is replicated across a number machines in the server pool, the loss of any given machine will not impact the availability of the application. If a server is unavailable for any reason, the lbnamed poller will see that it has failed to respond and remove it from the list of available servers. When the server comes back up it will be recognized by the poller and reinserted into the server pool.

Efficiency - All too often misused technology is mistaken for insufficient technology. lbnamed transparently maximizes the use of all available system resources by eliminating the need for users to be educated regarding each system's existence or capabilities. As faster systems are added to the server pool, they will naturally absorb greater proportions of the workload as their system load's indicate their capacity to do so.

lbnamed - Configuration Examples

In most cases, lbnamed will be used to create a virtual machine. A cluster of workstations can be configured to operate under one virtual hostname. The host that is the least loaded will be returned. For example, assume we have an inventory of:

5 Sparc 5 170s w/ SunOS 4.1.3_U1  hostnames: s5_1 through s5_5
5 Sun Ultra 2200 w/ Solaris 2.6   hostnames: u2_1 through u2_5

In order for lbnamed to delegate workload across multiple machines in a domain, it must be authoritative for that domain. In this example, we'll use the domain name lbnamed.centaur.org (see Figure 3). The primary nameserver for the organization, ns.centaur.org, will point any name service requests to the nameserver that is authoritative for the lbnamed domain; we'll use ns.lbnamed.centaur.org in our example. The nameserver, ns.centaur.org, will have the following entries in its db.centaur file used by the normal implementation of BIND1:

lbnamed    IN    NS    nslb.centaur.org

A name resolution request that is sent to ns.centaur.org for the host sunos.lbnamed.centaur.org will be forwarded to nslb.centaur.org, since it is authoritative for the lbnamed.centaur.org domain. This server in turn will resolve the hostname in the SunOS grouping, sunos.lbnamed.centaur.org, to be one of the five Sparc 5s. The one that is selected will have the least load on it as determined by the poller. The poller uses formulas based on users logged into a system in conjunction with load average to determine the given load of a host. The formulas used to determine the system load are:

fudge = (tot_user - uniq_user) * 20;
weight = (uniq_user) * 100 + (3*l1) + fudge;

where:

tot user = the number of users logged in
uniq_user = unique number of users logged in

l1 = the load average over the last minute multiplied by 100

The Sparc 5 with the best load average as determined by the poller is configured dynamically in the lbnamed server as alias sunos.lbnamed.centaur.org.

Configuring lbnamed is done primarily in the poller configuration file. Included in the lbnamed distribution is an example file, sweet.config. The syntax of the file is:

hostname    weight-multiplier    alias(es)

In our example, we have 10 different machines on which to balance our application load (see Listing 1). This configuration will create two virtual machines sunos.lbnamed.centaur.org and solaris.lbnamed.centaur.org.

Flow of Application Execution

The NT Workstation running eXceed can run applications native to NT locally, or run UNIX/X Windows applications remotely from SPARCstations or servers. Both platforms' applications will run, transparently to the user, from either NT or UNIX environments. File and print sharing between platforms also functions transparently.

The user of the Centaur desktop invokes either platform's applications through the NT Start Menu or desktop Icon. NT native applications run locally on the NT workstation. When a user activates a UNIX application, the NT operating system distinguishes that the program being executed is to be handled by the eXceed X Server. The eXceed "start file" passes the necessary runtime parameters to eXceed. These parameters include the host on which to remotely execute programs, the program to run with command line arguments, and the user ID under which the program is run. Next, the remote execution request is made by eXceed to a workstation contained in the "virtual cluster". Finally, the application is executed on the SPARCstation while being displayed to the NT Workstation making the request. The Perl wrapper script "ntXstart" sets the remote display (see Listing 2). This allows the user to connect from any NT workstation without having to manually set the location for remote X hosting.

Atlas - A Uniform Systems Management Platform

Atlas is a multi-platform, uniform systems management suite. It expedites navigation through the typical maze of enterprise-wide system components and provides both crisis management and comprehensive performance history for any UNIX or NT host, database, or batch process. Atlas uniquely integrates time, function, and structural aspect of support process; makes outages, systems' shortcomings, and support resources visible to everybody; and pulls the resources together to fix the problems.

Atlas delivers critical information while maintaining a uniform view above a forest of details and multiple platforms. It provides an intuitive view of every aspect of multi-platform systems, yet it selectively presents this information only at the request and pace of each individual who uses it. On the highest level, it provides a view of business lines and their associated hosts, data servers, batches, backups, and any other categories used to define systems components (Figure 4).

Architecturally, platform-dependent monitors periodically collect status of defined systems objects. Descriptions of the detected faults are passed on to a platform-independent fault management system, which in turn, diagnoses the problem using the configuration information available about the object and the severity and frequency of the fault. It then defines the appropriate notification method and recipients, and passes the diagnostics and the notification information on to the notification and problem tracking systems. Real-time and historical data of detected faults are displayed on both UNIX and NT platforms through a platform-independent GUI written in Java.

Atlas (specifically its problem tracking system) can be a powerful tool for ensuring that the support sector remains accountable to the users running and developing applications on both UNIX and NT platforms. In addition to the automatic creation of problem tickets when a fault is detected, users can submit requests through an intuitive GUI. Atlas also provides reports allowing both support managers and users to stay on top of the work flow. For the systems manager, pending reports show what work is outstanding for his or her group. For the systems user, support meeting agenda reports show all recently completed work and all outstanding requests. For all interested parties, there are daily backlog statistics, calling out priority items as a special case (Figure 5).

Atlas uses the Microsoft Performance Monitor, which is part of the NT OS, to monitor the NT hosts. The monitor is configured to raise alarms when certain error conditions are detected. When an alarm is raised, a message that includes the description of the problem and its source is sent to the UNIX fault management server via TCP/IP. The fault management server diagnoses the problem, issues notifications, and displays the current status and historical error information on the platform-independent Atlas Java GUI. NT host configuration information, collected via remote polling by Windows NT Diagnostics tool, is stored in a centralized database and is displayed by the Java GUI.

Open Issues

Although much has been accomplished in terms of cross-platform integration and systems management, there are still a number of open issues, which need to be addressed before Centaur can be considered a production quality environment. These issues fall into three broad categories discussed below.

Desktop Integration

Although Centaur begins to address many of the issues concerning desktop integration, there is still much work to be done. File sharing between UNIX and NT systems is still an open issue. One method is to run one of the third party NFS daemons on the NT box and share files via NFS. Although NFS is a well-established industry standard in the UNIX world, its performance and robustness is still in question on the NT side. An alternate method is to use NT's SMB protocol and run Samba, a public domain SMB daemon for UNIX. Both methods require testing for robustness and performance.

A second open issue is the decision between using XDM versus RSH in authentication and application launching. There are two clear design choices in this decision. XDM is more flexible in terms of management and authentication, but it uses a remote window manager and takes control of the entire NT desktop. The remote window manager is significantly slower than a local NT window manager.

Third, scores of applications must be tested to ensure that all functionality, aesthetics, and behavior exists on the eXceed X Server. Most likely special fonts, keyboard mappings, and user environments will need to be implemented on NT in order to provide seamless execution.

Finally, there is the issue of the load balancing name daemon. While the lbnamed is theoretically a big win in terms of robustness and performance, it is still in the beta testing stage and must be significantly stress tested before being used in a production environment.

Environmental Impact

Until recently, the majority of network infrastructure served one specific platform, either PC or UNIX-based systems. As the need grows for improved technology, networks are beginning to host a combination of these platforms. As more systems begin to populate a network, its efficiency and availability decrease as use proportionally increases. Can Centaur aide in reducing network traffic while making network resources such as ports available? Further testing and network traffic analyses need to be performed.

In recent years, disaster recovery has become a growing concern for firms in the financial industry. The focus of any disaster recovery project is to have an off-site location that will allow the business to function in the event of a disaster that would render the primary site useless. Needless to say, vast resources and expenses accompany this type of undertaking. Currently in an environment where UNIX and NT co-exist, both would be critical for conducting business. Does Centaur provide enough flexibility to provide a service to both UNIX and NT applications in the event of a disaster?

Systems Management

A centralized cross-platform systems management scheme needs to be developed. Currently, monitoring is supported, but user management for both UNIX and NT must be developed and tested. Aspects that need to be covered are: UNIX and NT menu management, cross-platform username resolution, email management, and application distribution.

Summary

Unitl recently, the primary Wall Street desktop computing environment has been the UNIX workstation. These workstations have provided a flexible and robust environment in which to run both in-house and off the shelf applications. Unfortunately, these workstations lack the ability to run common productivity tools typical to the PC environment.

In the past two years, Microsoft Windows NT has gained popularity as a workstation environment. The NT workstation combines the user friendly, GUI environment of Windows with a more robust, secure workstation platform. The NT workstation allows users to run PC applications in a more professional environment. Additionally, because NT runs on the PC platform (among others), it is an attractive choice from a corporate cost perspective. There are numerous debates as to whether NT is as good as UNIX for the financial environment. People contest its manageability, its robustness, and the true cost of ownership. In our opinion, NT will be a key desktop environment for the next generation of workstations, and UNIX servers will continue to provide a flexible, robust, and easily managed server environment. Combining NT desktops and UNIX servers into a single architecture, which we have dubbed Centaur, provides the best of both worlds.

A methodology for remote support of the combined desktop needs to be built and tested in a production environment. Among other issues, it must deal with the following: special keyboard mappings and different behavior under different X servers such as xnews, X, or eXceed. Also, different fonts used by UNIX applications need to be identified and configured on a font server or integrated within the standard NT blast. Finally, some file sharing issues include AFS and DFS synchronization must be addressed.

References

Y. Lirov, "Mission Critical Systems Management," Prentice Hall, 1997.

Y. Lirov, M. Ben-Chael, P. Brin, M. Covic, A. Rieger, A. Sherman, and T. Wagersreiter, "Web-based Distributed Sytems Management with Atlas,: forthcoming on Mathematical and Computer Modelling, 1998.

Y. Lirov, M. Ben-Michael, L. Chen, A. Rieger, A. Sherman, and T. Wagersreiter. "Atlas: A Universal Platform for Distributed Systems Management." Heuristics - The Journal of Intelligent Technologies. Spring 1997, Volume 10, Number 1, pp. 38-55.

Roland J. Schemers, III. "lbnamed: A Load Balancing Name Server in Perl." LISA IX, Fall 1995.

About the Author

Yuval Lirov is Senior Vice President at Lehman Brothers, in charge of the production of Portfolio Management Systems. He is an author of "Mission Critical Systems Management" (Prentice Hall, 1997) and over 100 technical publications and patents in distributed systems management and resource allocation.Dave Macolino is Vice President at Lehman Brothers leading the Internet and Intranet initiatives for the firm. He has over 10 years experience in the UNIX and PC platforms and has done extensive work on integrating the two platforms.Martha Ben-Michael is Vice President in Fixed Income Analytics department at Lehman Brothers. Previously she led the development of systems management projects in UNIX support area at Lehman Brothers and at the Israeli Ministry of Defense.Andrew Rieger is Vice President at Lehman Brothers, in charge of UNIX Support, managing 24x7 support of over 3,300 hosts and 360 dataservers.