Cover V09, I07
Article
Figure 1

jul2000.tar


Gettemp -- Built-In Temperature Monitoring for Sun Enterprise Servers

Alek Komarnitsky

Do you know what the temperature of your server room is right now? Do you have some way of being alerted when it gets too warm? If you have a room temperature probe, how do you know when a fan fails in your Sun Enterprise Server and the internal temperature starts to climb? These questions can be easily addressed with gettemp, a free software solution.

Sun Enterprise servers have built-in sensors (depending on the model) that can report on the ambient room or various internal temperatures. You can access this via the following command:

/usr/platform/'uname -i'/sbin/prtdiag -v
This command, however, returns quite a bit of hardware-specific information and has different formats, so it can be difficult to use as a quick and dirty temperature check.

gettemp comprises a couple of Perl/CGI scripts that format the output (depending on the server model and operating system) and generates a Web page showing the current temperatures. Optional alerting capability is available via a Perl stub routine, and you can also generate a viewable time history (see Figure 1).

This shows temperature data gathered every 10 minutes (polling interval is user definable) from various hosts in an easy to understand Web page. There are a number of clickable links:

  • “only AMBIENT temperatures” -- shows only data from those machines that report ambient temperature. Some only provide internal temperatures (that incidentally do go up when the room gets warm).
  • Convert to Celsius (or back to Fahrenheit) -- The output from prtdiag is actually in Celsius.
  • You can use the optional alert/notify “stub” routine so that the hostname is a clickable link to some notification URL associated with that hostname. Note that this is for viewing purposes -- the “stub” routine can also generate an event asynchronously. (We generate a syslog event that ends up causing an email.)
  • You can use the optional time history feature so that the “{cool,WARM,HOT!}” is a clickable link to the temperature history of that hostname.

The rest of the page shows the actual temperatures as being reported by the various servers and is fairly self-explanatory.

Installing and Configuring gettemp

Installing and configuring gettemp is quite simple. It is composed of the following programs:

  • gettemp -- Perl script that parses prtdiag output for temperature data. We put this on all machines under /usr/local/share/bin. You can run this from the command line to see how it works.
  • gettemp.all -- Perl script that rsh's to hosts and runs gettemp on them, and then generates HTML and other data. Basically, figure out which host you want to run this from and which user (we use a “captive” account) will need rsh capability to the hosts of interest. Then simply set up a cron entry to run this periodically.
  • gettemp.all.hosts.ph -- Perl include file of hosts for gettemp.all to check. The comments should be pretty self-explanatory. Enter the human names of the hosts and their hostnames. gettemp.all uses this data to see which machines to rsh to.
  • gettemp.all.ph -- Perl include file for gettemp.all with miscellaneous data. Again, the comments should be self-explanatory. Check out the optional “alert” functionality that is defined here and called out in the alert routine in gettemp.all.
  • gettemp.cgi -- Perl CGI script to display output of file. You'll need to drop this into your cgi-bin directory. Note that this script tries to make sure the inputs to it are “semi-reasonable” for security purposes.

While you may need some sort of sudo/root access to install these files, gettemp does not need any special privileges (besides the ability of the captive account to rsh to the target machines) to run. While you will need to do some minor code tweaks for your site, these are documented and should not be difficult. gettemp takes about a second or two to run, so it does not cause any significant load on your systems.

How does gettemp work in real life? It works quite well and has been in operational use for more than 6 months at a large (1,000+ UNIX nodes) site with more than 50 Sun Enterprise Servers scattered in various buildings administered by a variety of sys admins.

Previously, when there was an air conditioning problem, it would not be known unless someone walked into the computer room and discovered that it was like a sauna, or the users complained because a machine shut down. Internal fan failures, while rare, were not typically caught until the machine actually shut itself down.

With gettemp, you can take a quick look at all Sun Enterprise Servers by simply clicking on the Web page. When a server does get warm (either due to room or internal temperature), an alert is automatically routed to the appropriate sys admin so that they are aware of the issue and can call facilities, put fans in the server room, or shut the server down gracefully, if needed.

The time history feature has also been quite useful for identifying when problems occurred and removing ambiguity. For instance, a particular server room seemed warm if visited early in the morning. A review of the logs showed that the temperature typically started to climb after midnight, and then went back down starting at 6:00 A.M. A call to facilities yielded the response that there was nothing wrong with the air-conditioning. A follow-up fax of the gettemp time history logs resulted in an admission that some air conditioning was turned off at night that should not have been.

Some sites already have room temperature monitoring systems. gettemp can not only complement this, but can also provide monitoring of the internal temperature of the server. Also, since the alerting capability is simply a Perl stub routine, it is easier to ensure that the right type of notification occurs. Finally, the “price is right”. I'm aware of one site where the facilities group is spending over $50,000 to install computer room temperature monitoring, but it's not clear how or if admins will get notified when there is an issue!

Summary

At my site, gettemp has easily been incorporated into our 24/7 monitoring and alerting system, so there's also an overall warm fuzzy feeling of confidence that gettemp is monitoring things and the right sys admin will be alerted before things completely melt-down. gettemp has been downloaded hundreds of times by the Internet community and the email feedback has been very positive. It also resulted in polishing of the code, so feel free to send suggestions for further improvement.

Future gettemp Work

Future development is a somewhat bounded problem in that there should not be much more to really do. Obviously, if the output of prtdiag changes with new models and operating systems, some simple parsing code will need to be added. The time history is simply a text dump of the daily data -- you could hook MRTG, Cricket, or any of the graphical tools into this, but I was trying to keep things simple. Furthermore, software always has a few bugs lurking, but I think most of these have been fixed. gettemp could easily be expanded for other platforms and operating systems if an equivalent prtdiag command is available to query for temperature data. A tarball of gettemp with code, documentation, and examples can be found on the Sys Admin Web site () or at: http://www.komar.org/komar/alek/ -> Misc. Tech Stuff -> gettemp. The author can be reached at alek@komar.org and he welcomes any suggested enhancements, bug reports, fixes, or comments in general.

About the Author

Alek Komarnitsky has spent the past 5+ years as Chief Technologist for a large IS consulting/outsourcing firm and helps manage a network of over 1,000 UNIX workstations (and other assorted stuff) scattered from coast-to-coast supporting (literally) rocket scientists. Previously, he was the Network/Systems Manager for two Boulder County software start-ups where he built the computing infrastructures from scratch. His educational background includes an Aero/Astro Engineering undergraduate degree from the University of Washington and an MBA from CU-Boulder. He can be reached at: alek@komar.org.