dec2002.tar

Remote Machine Monitoring with Marvin

Alistair Young

I was asked to produce a monitoring application for the University of the Highlands and Islands (UHI) Project, an organization consisting of 13 academic partners, 2 associate institutions, and a directorate. Our network connects the partner campuses and 70 learning and outreach centers throughout the highlands and islands of Scotland. We had found that, occasionally, machines would drop off the network for reasons such as unannounced engineering work or just plain failure.

Because the WWW unit is responsible for hosting the partner Web sites and providing access to the Internet for private block IP machines, it was imperative to know when the Web server and proxy machines were unavailable and also when the Webmail server was doing something it shouldn’t. Many staff members work from home over the weekend, and Web access to their email is vital, so any solution had to take this into account. Student record systems running on Oracle databases were also pretty important, so I also needed a way to ensure these machines were doing what they were supposed to. In all cases, a technical person had to find out about the fault before the users.

On the Oracle front, Perl was out as I wanted a robust system, both simple to implement and use. Perl required the entire Oracle client on the monitoring server. Java didn’t, as Oracle supplied a type 4 ODBC driver. So, with a special user added to the Oracle systems, I wrote the database monitoring in pure Java, simulating in code what a user would do to extract information from the database. If the process failed, then the machine wasn’t functioning and an alert could be produced.

With the other machines, the solution was on two fronts. Is it serving? Is it responding? Therefore, I needed to check HTTP and ping. I identified three types of machines:

1. Web server
2. Proxy
3. Router

Armed with this information, and the fact I wanted to use as much open source as possible, I dived into the murky world of object-oriented (OO) Perl, and Marvin was born. I named this application after the intelligent robot in The Hitchhiker’s Guide to the Galaxy by Douglas Adams who spent all day doing menial tasks such as talking to doors.

Anatomy of an OO Perl Module

In Perl, a class is just a package, so declaring a simple Server class is easy:

package Server
{
  my $fields = {_name => "www.sysadmin.com"}; # Reference to an anon hash
  sub new { bless ($fields) };
}
1;

The last line reminds you that it is just a package after all. Class members and methods sit inside the curly braces. This is OO Perl at its simplest — no encapsulation/inheritance or interesting stuff like that. Calling new, you get a reference to an anonymous hash containing the class members. You have to do the work of the compiler. While reading about all this, two methods dropped out — the “Flyweight pattern” and “Secure hashes”. Now, secure hashes give you just about every OO feature you could want but at a price — speed and efficiency go down the drain and anyway, it’s Perl, you shouldn’t be paranoid. So, this led me to use the flyweight method, which basically hides a class’s methods and members in a randomized hash. You create the class, create a unique number, associate the number with the class, and bless the number. What you then have is a reference to a hash key, whose data points to the class you’re after:

...
{ # Extract from Server.pm
  # Start of "class"
  my %_data;
  my %_fields = {
  ...
  # Public:
  type => undef,   # server/proxy/router
  name => undef,   # Description
  # Private:
  _watch_interval => undef,   # Watch machine every so many secs
  _last_watch     => undef,   # Epoch time of last watch
  ... };

  # Extract from the constructor
  sub new
  {
    my ($class, %args) = @_;

    # Get a reference to the allowed fields hash
    my $dataref = {%_fields};
    ...
    # Create a unique key to identify this class...
    $dataref->{_key} = rand
      until $dataref->{_key} && !exists $_data{$dataref->{_key}};

    # and store it in the class
    $_data{$dataref->{_key}} = $dataref;
    ...
    bless \$dataref->{_key}, $class;
  }

  # "Get" method
  sub get_name {return $_data{${$_[0]}}->{name}}
} # end of "class"
...

The class contains the “_fields” hash, which is a container for all the members, and the enclosing braces mean that only functions declared within can access this hash. Normally, you could now bless a reference to this hash but that wouldn’t be OO, so instead, a unique number is generated, stored in the “_data” hash, and used to point to “_fields”. This unique number is what I bless. So, what you get back when you call “new” on this class is a reference to an index into a hash, which points to the hash storing the actual class data. The “get_name” method looks horrendous but is actually:

1. class_name = $_[0] ==> class name
2. class_ref = ${class_name} ==> Dereference to get the unique number generated by the call to “new”
3. $_data{class_ref}->{name} ==> Get the “name” member from the class “_fields” hash

This is the only way to get this class member’s data. It’s OO, sort of.

The Daemon Principle

Rather than use a cron job, daemonizing the application provided much more flexibility in polling machines. I wanted people to use the Web front-end to add machines to the list and set a polling time independent of other machines. If they required the polling time increased for a short period, they could do it through the front-end without affecting anything else. This just wasn’t possible with cron. So, I learned about daemons:

sub daemonise
{
  chdir '/';                    # (1)
  umask 0;                      # (2)
  open STDIN, '/dev/null';      # (3)
  open STDOUT, '>$stdout_file';
  open STDERR, '>$sterr_file';
  defined (my $pid = fork);     # (4)
  exit if $pid;                 # (5)
  setsid;                       # (6)
}

1. The first thing to do is get out of any weird path, like a mount.
2. A umask of 0 lets us set any of the bits for the “process” we’re about to “create”.
3. Redirect STDOUT and STDERR to log files to keep the terminal uncluttered. The whole point of the daemon is that it doesn’t require a terminal.
4-5. Fork a new process and kill the main one.
6. Run the child in a new session to detach it from the terminal.

With the app running as a daemon, I could control it by modifying its conf file and HUP’ing it.

SMS

As I said previously, I wanted to use as much open source as possible and this included gnokii, the Linux mobile phone driver. All it requires is a Nokia phone plugged into the serial port, a phone number, and a message. By wrapping this in a Perl module, Notify.pm, and populating it with info from the conf file, I had instant access to various technical bods at all hours of the day and night. To wake Joe Bloggs at 3am when a router stopped responding was simplicity itself:

echo "wake up! Your router is down" | gnokii —sendsms 0777745678

The driver has numerous functions, which I plan to exploit and are described in further developments.

The conf File

I needed simplicity:

...
# Notification nicknames
# This identified Joe Bloggs by the nickname "joe". He can be 
# contacted via email and SMS (yes,yes) and his contact details follow.
notify,joe,Joe Bloggs,yes,yes,joe@work.com joe@home.com,0775644334
notify,jim,Jim Bloggs,yes,yes,jim@work.com jim@home.com,0775644334 0777745678

# URLs to test HTTP servers
# This defines a set of URLs named "test_urls"
urls,test_urls,www.bbc.co.uk www.annwiddecombemp.com www.cnn.com www.ed.ac.uk

# This identifies a machine to test. In this case it's a proxy 
# server, port 8080 with a timeout of 100ms, doesn't implement 
# sysbot (see later), has a polling interval of 600s and is tested 
# using the "test_urls" set of test URLs. If something goes wrong, 
# "joe" and "jim" want to know.
proxy,Squid2,squid2.smo.uhi.ac.uk,8080,100,nosysbot,600,test_urls,joe jim
...

The log file

A sample log line for an HTTP server being monitored:

Sun Sep 8 11:23:59 BST 2002,vader.uhi.ac.uk,ok,http://www.uhi.ac.uk,ok,3.80,100,,

To begin, you get the date/time of the poll, then the machine name, the URL used to test it, and the response. After that, get the ping status. “ok” means it responded within the timeout set in the conf file. The actual time it took in ms is next, followed by the timeout in ms. The last two entries are for sysbot-supplied uptime and load. As this server isn’t running sysbot at the moment, these are blank. If the server stops handing out HTTP, we can check whether that server is up from the ping time. If both are not “ok”, then it’s a safe bet the machine is down, unless a router near it is also reporting not “ok”, in which case it might just be unreachable.

The Web Front-end

To get around the problem of tech bods sending me emails to add machines to the list, I developed a Web front-end in PHP, which displayed the current status of all machines Marvin knew about and allowed technical staff to modify the configuration and add new computers to the list. Status information was provided by Marvin via a PHP associative array, which it generated on the fly and which the PHP front-end included:

...
"UHI Web Server" => "server!!!Mon Sep  9 09:41:14 \
  2002!!!ok!!!http://www.uhi.ac.uk!!!ok!!!3.84!!!!!!!!!",
...

The sample line above told PHP that the “UHI Web Server” was last polled on Mon Sep 9, was serving ok, and was within the ping timeout. Figure 1 shows an extract from the status page showing a summary of machines. As long as all lights are green, no further drilling down is required. Figure 2 shows an example of drilling down through a particular machine to get its monitor details. Figure 3 shows configuration details for a Web server.

I’m also working on porting the original JSP code, which produced a summary of problems for people looking at the main Web site. This provided status information along the lines of “Web server slow today” and other such information. With status data collecting in the log files, I started to investigate graphing methods to display it. Originally, I was going to use RRD tool, but as the polling rates would be variable, it wasn’t really an option. Instead, I chose JpGraph PHP graphing classes because I had already used these to build graphs of our proxy servers’ memory and CPU rates.

Further Developments

I’m working on enhancements such as user identification on the front-end to only allow updates to those machines you have added. Sysbot is coming along with authentication being built in and a range of other machine variables being added. At the moment, it monitors uptime and load, allowing me to tell me when a machine has been rebooted and whether the load on it was inordinately high just before it went offline.

One of the major enhancements will be server control via mobile phone. If your server goes down, you can reply to the SMS message with a reboot command. As long as the sysbot port is open and the server is reachable, the reboot signal will get through.

Usage Example

The application has proven its worth by alerting key personnel when a fault has occurred (such as Webmail not serving) and they could fix it before someone important noticed. The SMS function proved to be a double-edged sword though. As an example, shortly after the application went live, I drove the forty-odd miles to Portree to MOT my car. On the way, a blanking plate blew off the exhaust and I roared into town sounding like a jumbo jet. After I dropped the car off at the garage, my phone went “ping” and a message appeared from Marvin — one of the proxies had stopped serving! When I get round to implementing sysbot, I will have the functionality to reboot that machine from Portree high street. If nothing else, it would make for interesting conversation in the baker’s queue:

“White or brown loaf, dear?” “Er, could you excuse me please, I’m just rebooting my proxy!”

References

Object Oriented Perl by Damian Conway. Manning, ISBN 1-884777-79-1.

Programming the Network with Perl by Paul Barry. Wiley Computer Publishing, ISBN 0-471-48670-1.

JpGraph — http://www.aditus.nu/jpgraph

Dependencies

HTML-Tagset-3.03
HTML-Parser-3.26
libwww-perl-5.65
Mail-Sendmail-0.77
URI-1.19
Time-HiRes-01.20
Net-Ping-2.19
Time-modules-101.062101
gnokii

Links

http://www.uhi.ac.uk
http://www.smo.uhi.ac.uk
http://www.suse.com

Alistair Young, graduated with a BSc(Hons) in Physics & Microsystems from the University of Abertay, Dundee. He spent five years at OKI (UK) Ltd. writing Windows printer drivers in C/C++ and InstallShield applications. Alistair joined the UHI Millennium Institute in 2001 as a senior software engineer, producing server monitoring software and Web applications. He is now happily out of kernel mode and playing around in Flash/PHP/Java and Perl. Alistair can be contacted at: sm00ay@groupwise.uhi.ac.uk.