Article

Software-based Cluster Computing

Victor Hazlewood

Do you still believe batch software is just for mainframes and supercomputers? Not any more! The Network Queueing System (NQS) has its origins on mainframes and supercomputers, but the capabilities of today's batch systems go far beyond those of their ancestor. True batch environments (not just at or cron) now provide tools for system administrators to manage an enterprise-wide network of high-performance workstations and servers.

In commercial and academic environments, the availability of high-performance desktop workstations and servers in workgroups, classrooms, and laboratories has become more and more commonplace. In these environments, Unix system administrators are expected to satisfy the diverse needs of department and enterprise users by squeezing the most out of available computer resources. A viable option includes providing a batch and parallel processing environment with load balancing capabilities on these systems. This type of environment is commonly referred to as a workstation cluster.

A cluster can be defined as a collection of computers on a network that can function as a single computing resource [1]. Providing a cluster requires software to effectively control job and system resources, to load balance across the network, and to exploit idle cpu cycles on the desktop while controlling interference with interactive use of these systems. Yesterday's single processor batch environments have evolved into today's distributed computing environments.

This article, based on a project completed in 1995, explores the features available for a software-based batch environment on a cluster, compares several of the most popular commercial and public domain batch environments, and provides an overview of the installation and configuration of one of the commercial products on a small cluster of workstations. Note that hardware-centric clustered systems, such as those available from Data General, Digital Equipment Corp, and others, were outside the scope of this project.

Beyond NQS

Commercial and public domain batch queueing environments provide network load balancing, configurable job scheduling, job resource control, fault tolerance, network resource monitoring, and, in some cases, a Web user interface. These software systems have the potential to deliver the untapped resources of high-performance workstations and servers that have previously gone to waste.

Many workstation vendors are beginning to include clustering software with their operating system or as optional asynchronous products. Table 1 shows the most popular queueing software available on Unix workstations along with many of the features they offer. Additionally, Table 1 shows which products the workstation vendors support, and clearly there is no clustering software winner. Again it is up to Unix system administrators to keep informed of the options and determine the solution that best suits their particular needs and environment.

The following is a list of the most popular commercial and public domain queueing and clustering products available today:

CODINE from GENIAS Software GmbH,

Condor from University of Wisconsin,

DQS from Florida State University,

Generic NQS from University of Sheffield, UK,

LoadLeveler from IBM,

LSF from Platform Computing Corporation,

NQE from SGI/Cray, and

PBS from NASA/Ames.

This is not an exhaustive list, but contains arguably the most popular public domain and commercial distributed queueing and clustering systems. Contact information for these systems is listed in Table 2.

Feature Comparison

In comparing the nine software systems, Table 1 shows the availability of key features most often listed in product reviews, technical papers, marketing materials, and product documentation. Table 1 and the feature descriptions have been adapted and updated from the report done by Kaplan and Nelson [1].

The following subsections describe each of the features compared in Table 1 in a short paragraph. The features listed provide capabilities of interest in large department and enterprise-wide cluster environments. Each feature provides a unique capability and when combined with several other features, the combination provides a powerful environment for resource use and control across the cluster.

Heterogeneous Support

A heterogeneous cluster consists of some number of computers of different architectures running different operating systems. To be useful in today's open computing environments, a cluster system must support many different architectures and operating systems.

Batch Queueing Environment

A batch job is defined as a shell script containing commands that can be executed without the intervention of the user. Batch queueing environments provide the software tools necessary to allow users to submit, monitor, and maintain batch jobs. For most Unix systems, NQS is considered to be the standard batch queueing interface. All nine systems listed in Table 1 provide NQS or an NQS-like capability.

Parallel Programming Support

Many users are finding the need for a collection of heterogeneous computers that can be used as a coherent and flexible concurrent computational resource. Software libraries and subsystems, such as Parallel Virtual Machine (PVM) and Message Passing Interface (MPI), are used to exploit parallelism across tightly coupled and loosely coupled computer systems. Some clustering systems currently include support for parallel jobs while others do not. Users are finding parallel support in these environments useful, and many vendors are working on providing these capabilities.

Checkpointing

Checkpointing is the ability to create a restart file containing all of the necessary state information to restart a batch job. Checkpointing is a sophisticated feature that is made available in two ways:

1. it requires the user to link the application with a specific set of libraries that provide the checkpointing capability, and

2. it requires the batch job to be run on a system that provides checkpoint/restart capability within the operating system.

Currently, UNICOS is the only operating system known to provide checkpoint/restart capability. UNICOS provides this with system calls chkpnt(2) and restart(2). SGI will be providing similar system calls in IRIX 6.4 for 64-bit systems.

Process Migration

Process migration is the ability to move a job from one computer system to another without restarting the job from the beginning. Support for this mechanism is usually provided by special linking requirements for the application or migration across homogeneous systems with operating system support for the feature. Process migration across a heterogeneous environment is extremely difficult and requires an enormous amount of information about the process to be successful. Heterogeneous process migration is not available at this time.

Configurable Scheduler

A new feature that has emerged in several of the products during the past 12 to 18 months is the configurable scheduler. This feature allows system administrators to implement site-dependent political and load-dependent scheduling algorithms for initiation of batch jobs throughout the cluster. Current implementations provide ways to schedule jobs through Perl, Tcl, and C programs.

Load Balancing

Load balancing is the process of spreading work (batch jobs) throughout the cluster to prevent saturation of one host while remaining idle on another host. Most cluster systems implement load balancing by collecting system activity (load) information on each computer in the cluster and passing this information to a master load balancing server. The master server uses the information to place batch jobs on hosts in the cluster. Many of the load balancing mechanisms allow system administrators to configure site-dependent system activity objects and load balancing policies.

Run Time Limits

Allowing system administrators to set run time limits on cpu time, memory, disk, and other resources and enforcing those limits on batch jobs prevents the abuse of computational resources by the users. These limits are usually set and enforced within the batch queueing environment.

Graphical User Interface/Command Line Interface (GUI/CLI)

Each of the nine cluster systems supports a command line interface (CLI) into their environment. Providing CLI mechanisms for submitting, monitoring, and maintaining batch jobs throughout the cluster is a basic feature of these systems. Providing a graphical user interface (GUI) can significantly improve the productivity of cluster users (especially novice users) and is fast becoming a required feature of the popular systems.

Public Domain

Public domain cluster systems provide essential features for those sites that require the ability to make changes to the software to implement site-specific requirements or site dependencies. All of the systems that are in the public domain provide source code at little or no cost.

Web Interface

A feature driven by the power and influence of the World Wide Web is rapidly emerging in these products and allows users to submit, monitor, and maintain batch jobs through Web browsers. A Web interface into the cluster environments will continue to grow in popularity and may replace the X Window-based GUI in the near future.

Selecting a Product for Your Site

Is providing a batch or cluster environment necessary at your site? Questions to ask include:

1. Are cpu cycles available on the desktop going to waste each night when the one user who makes use of the system is at home, on vacation, or on travel.

2. How many of the desktop systems are high performance systems?

3. Are your computational servers loaded, and are users requesting more cpu cycles from you while you have little or no hardware budget to satisfy their requests?

4. Can your loaded servers benefit from more effective and efficient use of resources by limiting interactive use and providing a batch environment?

If you answered yes to any one of the above questions then you may want to investigate one or more of the products listed in Table 1. Table 2 provides contact information.

If you are not sure what your site requirements are, you might want to evaluate one or more of the systems for a short period. This will help provide insight into which features you could use at your site. The public domain versions are available free for evaluation, and most commercial products are available for evaluation with license files that expire within 1-3 months.

Installation Tutorial

While investigating batch systems in Table 1, I performed an evaluation of the NQE product available from Cray Research. Because I was familiar with the batch environment available on Unicos systems (Cray's version of Unix), I wanted to investigate the use of this system on workstations. An evaluation copy of NQE to install and configure was obtained from Cray Research. The following is a tutorial of the installation of NQE 3.1 on a cluster of eight UltraSPARCs running Solaris 2.5.

Installation Goals

The goal of the installation of NQE 3.1 was to make a load balanced cluster across 8 UltraSPARCs. I will refer to these systems as the Ultra Farm or ufarm nodes. Each of the ufarm nodes have a unique hostname and an alias. The host names are solomon, xena, zero, groucho, smurf, bozo, galt, and anime. The aliases are ufarm1, ufarm2, ufarm3, ufarm4, ufarm5, ufarm6, ufarm7 and ufarm8, respectively. Another goal was the configuration of ufarm1 as the master server and ufarm2-ufarm8 as execution servers. The master server is a special type of server that collects status information from each host and acts as the network load balancer (NLB).

Getting Started

The NQE 3.1 software and documentation was obtained. NQE 3.1 came on a CD-ROM containing software for all RISC systems, a CD-ROM with online documentation in Electronic Book format, an installation manual (RO-5237), an NQE Administration manual (SG-2150), and a license file. Supported on the CD-ROM were IRIX 6.2, SunOS 4.1.3, Solaris 2.4, AIX 4.2, OSF/1 V3.0, and HP-UX 9.0. The ufarm nodes are all Solaris 2.5 systems, and even though NQE 3.1 was only tested on Solaris 2.4 systems, it was installed and run successfully on these systems.

Installing from CD-ROM

I started the installation on ufarm1 (solomon). The installation required the mounting of the CDROM and running the installation program /cdrom/cdrom0/INSTALL as root. A confirmation message appeared on the screen (see below), and then an NQE configuration X window appeared (see Figure 1). The NQE configuration window gives the administrator configuration options regarding the root directory of the installation (NQE_ROOT), the type of NQE installation to perform (NQE_TYPE), which can be Master Server, Execution Server, or Client, and the spool directory (NQE_SPOOL). These selections should be changed to match the appropriate configuration for each installation. I changed NQE_ROOT to be /usr/local/apps, NQE_SPOOL to be /var/nqe/spool, NQE_TYPE to be MASTER_SERVER, and all the other values were left at their defaults. The installation of the software on ufarm2 through ufarm8 was also completed, because /usr/local/apps was NFS mounted on all ufarm nodes. Listing 1 shows the output from the installation command.

Directory Structure

The installation created the following directory structure under /usr/local/apps.

craysoft/nqe/3.1.0.1

/ / / | \ \ \ \ bin/ etc/ examples/ include/ lib/ man/ solomon/ www/

Note the directory for solomon. This directory contains the very important file nqeinfo, which contains the environment variables necessary for NQE to run successfully. A directory for each ufarm node with the same name as the hostname (zero, xena, etc.) was created by hand, and the nqeinfo file was copied to each of these directories. NQE 3.0 included a command that worked well for installing NQE on servers that shared the software installation file system called add_node. Unfortunately, this feature has been removed and now must be performed by hand.

The following changes were made to the nqeinfo file for the other ufarm nodes; the variable NQE_TYPE was changed to EXEC_SERVER, the variable NQS_SERVER was set to the hostname of each of the execution servers. There was a utility that would update the nqeinfo file called nqeconfig, but I chose to edit the nqeinfo files by hand. The documentation covers in detail the use of nqeconfig to configure an nqeinfo file.

Finishing the Installation

Next, the following was added to the /etc/services file on each ufarm node:

msql 603/tcp        #NQE msql service
nlb  604/tcp        #NQE nlb service
fta  605/tcp        #NQE fta services
nqs  607/tcp        #NQE nqs service

The license file license.dat was created from the printed license file included with the installation materials and placed in /usr/local/apps/craysoft/nqe/3.1.0.1/etc/license.dat. The license file for evaluation copies is typically valid for one month. The nqemaint utility was run next to configure some symbolic links. This is useful when you have multiple versions of NQE available. This command checked for multiple versions and, in this case, made version 3.1.0.1 the default. Essentially, this command was used to create the following symbolic links:

/usr/local/apps/craysoft/nqe/bin -> /usr/local/apps/craysoft/nqe/3.1.0.1/bin

/usr/local/apps/craysoft/nqe/etc -> /usr/local/apps/craysoft/nqe/3.1.0.1/etc

/usr/local/apps/craysoft/nqe/examples -> /usr/local/apps/craysoft/nqe/3.1.0.1/examples

/usr/local/apps/craysoft/nqe/include -> /usr/local/apps/craysoft/nqe/3.1.0.1/include

/usr/local/apps/craysoft/nqe/lib ->  /usr/local/apps/craysoft/nqe/3.1.0.1/lib

/usr/local/apps/craysoft/nqe/man -> /usr/local/apps/craysoft/nqe/3.1.0.1/man

/usr/local/apps/craysoft/nqe/www -> /usr/local/apps/craysoft/nqe/3.1.0.1/www

A link was then made from /etc/nqeinfo on each NQE host to its corresponding nqeinfo file. The nqemaint utility can create this link, but I opted to do this by hand. The following symbolic links were created on the ufarm nodes:

solomon: /etc/nqeinfo -> \
/usr/local/apps/craysoft/3.1.0.1/solomon/nqeinfo
xena: /etc/nqeinfo -> \
/usr/local/apps/craysoft/3.1.0.1/xena/nqeinfo
zero: /etc/nqeinfo -> \
/usr/local/apps/craysoft/3.1.0.1/zero/nqeinfo
groucho: /etc/nqeinfo -> \
/usr/local/apps/craysoft/3.1.0.1/groucho/nqeinfo
smurf: /etc/nqeinfo -> \
/usr/local/apps/craysoft/3.1.0.1/smurf/nqeinfo
bozo: /etc/nqeinfo ->  \
/usr/local/apps/craysoft/3.1.0.1/bozo/nqeinfo
galt: /etc/nqeinfo ->  \
/usr/local/apps/craysoft/3.1.0.1/galt/nqeinfo
anime: /etc/nqeinfo -> \
/usr/local/apps/craysoft/3.1.0.1/anime/nqeinfo
Starting NQE

To start NQE, /usr/local/apps/craysoft/nqe/etc/nqeinit was run by root on each system. It is important to start the master server first, because once started each execution server sends system status information to the master server for use in load balancing. Example output from nqeinit for ufarm6 (bozo) can be found at http://www.sdsc.edu/~victor/Projects/NQE/startup.html.

Using NQE

At this point NQE was running on the Ultra Farm, and I used cload to take a quick look at each of the systems. See the snapshot output of cload in Figure 2 and Figure 3. The batch environment was reviewed with the qstat command. Listing 2 shows what was seen on ufarm1 and ufarm2.

To submit a job with NQE, the qsub command is used. An example would be:

solomon% qsub myjob
nqs-181 qsub: INFO
Request : Submitted to queue  by .

where myjob is a file containing Unix commands to be run on the destination host. This file is much like a script file that is run without any user interaction.

NQE Configuration

The default NQE configuration sets up a batch queue called nqebatch and a pipe queue (used for routing) called nqenlb. Out of the box, the default queue is the pipe queue, nqenlb, which sends jobs to the load balancer, ufarm1, for destination selection. The default configuration of the load balancer selects the most idle system which to route jobs. More complex configurations of batch and pipe queues are possible, and the load balancer can be configured to route jobs based on policies that can be configured by the administrator. Values available on which to base load balancing policies include;

physical memory,

free memory,

number of cpus,

free disk space,

free swap space,

I/O transfer,

system cpu,

idle cpu, and several others.

More information on NQE configuration is available in the NQE Administration manual SG-2150 available from Cray Research [2].

Web Interface

The Web interface with NQE 3.1 is made up of cgi-bin scripts that allow a Web browser to interface with the following NQE features:

job submission,

job status,

job control.

The cgi-bin scripts are provided on the CD-ROM and in conjunction with an NCSA httpd or Apache Web server, these scripts provide all that is needed to allow job submission, status, and control through a Web browser.

As currently implemented, the NQE Web interface requires the Web server to run as root. For security reasons, I decided to run a Web server for the sole purpose of providing NQE access. This would limit exposure to possible security threats. Security concerns should be taken into consideration if the Web interface is used.

Conclusion

I found NQE installation to be straightforward and easy to complete successfully. The Ultra Farm environment was entrenched in NFS mounting of the software installation filesystem and home directories. NQE worked well in this environment, and I would expect it to work just as well in a standalone environment. Usernames and uids were the same on all ufarm nodes, so there were no issues with uid mapping. However, NQE does have a mechanism similar to .rhosts to facilitate uid mapping between hosts.

As a result of the evaluation, I was impressed with the ease of installation, network load balancer features and flexibility, World Wide Web interface, and the heterogenous platform support. I found NQE to be a viable solution for sites interested in providing commercial quality enterprise batch capabilities to Unix systems.

Bibliography

1. Kaplan, J. A., and Nelson, M. K. 1994. A Comparison of Queueing, Cluster, and Distributed Computing Systems, June, NASA TM 109025.

2. Cray Research, Inc, NQE Administration, SG-2150, September 1996.

3. Cray Research, Inc, NQE Release Overview and Installation Bulletin, R0-5237 3.1, September 1996.

4. http://www.cmpharm.ucsf.edu/~srp/batch/systems.html

5. http://www.crpht.lu/~petrec_a/html/hpc/software/dbp.news.html

Acknowledgements

This work was funded in part by the National Science Foundation Cooperative Agreement ASC-8902825. All brand and product names are trademarks or registered trademarks of their respective holders.

About the Author

Victor Hazlewood is Manager of the Production systems Group at San Diego Supercomputer Center and has been having fun with PCs, workstations, and supercomputers running UNIX for more than 10 years.