Useful Scripts for Overworked Administrators

Mark Prager

I work for a startup company, which means we face the usual problems of financing. Because many automated system tools are very expensive, I have written several scripts to help automate some of my daily tasks that monitor our system. I write these scripts in ksh and csh, and, where necessary, a few small C programs because these seem to be the least complicated. The C programs were compiled with gcc. Similarly, the scripts usually come out self-documenting, which means I can leave them running and return to them several months later and still understand what I was trying to do. Also, the scripts run under Solaris 2.5.1-7.0, and can easily be used on other UNIX operating systems.

Most of the scripts are run in collaboration with cron so that I get periodic checks, although nothing stops them being run ad hoc. I also assume that the servers have authority to enter each other without being requested for passwords (achieved by using the .rhosts file in the root home directory, or hosts.equiv in the /etc directory). This security arrangement is sufficient for our network, however, .rhosts and hosts.equiv are not considered secure enough for many organizations. You may need to adapt these scripts to the security structure of your own environment. These scripts could probably be changed to be more efficient; however, I wrote them during a severe time crunch. Once they were working (because I am of the opinion that you don't fix something that is not broken), I left them as originally written.

The first script (Listing 1) is a fairly simple script that I wrote to monitor the disk space on our servers. In the script, the variable "limit" represents the percentage limit after which I want to receive an alert that the disk is getting full. The variable comp_list is the list of servers that I want to check. The next two lines are the initialization of the output file that will be emailed at the end of the operation. The script then runs on each server in the list, gets the percentage use (from the output of df) of all filesystems, and gets the filesystem (mnt) for those percentages. The script then checks each percentage to see whether it is greater than my quota limit. If so, it writes to an output file detailing which filesystem is overloaded, and with what percentage. At the end of the operation, if an output file has been generated, it is mailed to me.

Example of the output from Listing 1:

From: Super-User [root@sword.fish]
Sent: Sunday, February 11, 2001 5:00 PM
To: mark.prager@seabridgenetworks.com

 91% on barracuda : /raid308
 92% on seal : /export/raid1

I run this script at hourly intervals; however, it could be run at closer intervals, and the alerting program could be changed (email) to an SMS messenger program or X Window pop-up. Similarly, the script can be slightly modified to provide the usage of all filesystems on all the servers at periodic intervals, writing the output to a file, which could later be operated on to produce a history of the disk space usage on all servers.

At our company, instead of having a UNIX desktop for every user, we have a number of central servers. Using a PC tool like Exceed, every user can log into every server. The trouble with this scenario is that every user likes to think that the server belongs to him alone! To discourage such thinking, I wrote a script (Listing 2) that warns users if they have too many processes running on the server. Although this script does not actually kill any processes, the warnings can be annoying and are good enough to keep the users aware of what they are doing.

The main part of the script starts on line 14. I first get a list of all the processes on the server and filter out those users that I don't want to be notified about (e.g., root and daemon). The UID part filters out the banner line of the ps command. The list is then sorted according to username. Line 16 is used to initialize various shell variables; the variable last is the username that will be checked. I initially gave it the value qwert, because I know there are no users with that name on our system.

The mailusermap file looks like this:

...
markp+mark.prager@seabridgenetworks.com+35
tvguser+tvguser@seabridgenetworks.com+70
ccadm+mark.prager@seabridgenetworks.com+100
...

It is basically a database of all the users on the system, their email addresses, and the number of processes each are allowed to have. As shown by the example above, markp can have up to 35 processes, while tvguser (which might be a common or group account) is allowed up to 70 processes.

The first time around, the loop does nothing because there is no user called qwert. The next time around, we get the process limit of that user (userquota), and the loop then counts how many processes that person has. If the variable last is not the same as variable i, then we have finished counting all the processes for that user (remember the list was sorted on line 15).

Lines 23-29 check whether the user has overstepped her limit. If so, the function mail_to_user is called (lines 2-13). The lines 34 - 41 are the contents of the loop again, used for the last user on the sorted ps list.

In the mail_to_user function, Lines 5 and 6 determines the user to be informed of the quota overload, and line 7 is a simple script that is called to print out a beginning of the email to be sent. The executable on line 8, pstree, is a freebie I downloaded from the Internet, and it prints out the processes tree list for a given user. Line 9 finishes off the email, and line 11 emails it to the user.

I run the following script hourly in conjunction with another script from cron:

#!/bin/csh
set comp_list = 'stingray medusa sword seal shark salmon tuna octopus dolphin'
touch /tmp/comp$$
rm /tmp/comp$$
foreach comp ( $comp_list )
        set res='rsh $comp /usr/local/scripts/count_proc_ksh'
        echo $res >> /tmp/comp$$
end
cat /tmp/comp$$ | tr '@' '\n'

Notice that the last line translates @ into a new line character. This is because line 4 of the main script prints out the user who has overstepped his limit and which host, terminated by an @. Hence, at the end of the script, we get a report of all the users that have overstepped their limits sent by email (output of cron) to the administrator. Figure 1 shows an example of the letter sent to a user. Figure 2 shows an example of email sent to me.

One problem with having central servers is that, if I want to find a certain process on all the servers, I must look into each server, do a ps on it, and search for the process. To avoid this, I wrote the following small script, which I can run centrally:

#!/bin/ksh
comp_list="stingray barracuda medusa seal salmon octopus tuna dolphin sword"
for comp in $comp_list
do
rsh $comp "ps -ef | sort | sed 's/^/'$comp' /'"
done

This script runs through the list of all the servers and for each server, runs the ps command, sorts it, and (using sed) adds the name of the server to the output. This script is very useful, especially when looking for a user who is hogging system resources through heavy commands like make and link.

A slight modification of the above script allows me to check the availability status of the important servers at our site. The servers need not be only UNIX, they can be NT and other black boxes such as routers:

#!/bin/ksh
# ping all servers - when one goes down - let me know.
servers="router1 router2 accelar1 barracuda shark medusa shark sword dolphin 
tuna seal octopus salmon stingray tiger hippo rhino puma fox zebra elephant wolf"
for i in $servers
do
    A='ping $i 10 | grep "no answer"'
    if [[ $A != "" ]] then
        ## Program to notify me of the problem by Xmessage
        DISPLAY=172.30.30.122:0.0
        export DISPLAY
        echo "$i is DOWN" | /usr/local/bin/xmessage -fn charr24 \
          -bg yellow -fg blue -file - -center &
        # SMS Page me too
        cd /users/system/mark/sms
        # Cellcom
        /users/system/mark/sms/page_mp "PING" "server $i is DOWN"
    fi
done

Each server is pinged with a 10-second timeout. If a "no answer" is received, an X-message pop-up window is sent to me saying that the specific server is not answering, and I am also paged. Because the page_mp script is adapted especially for the mobile service in my country, I won't go into those details here. However, the script can be easily modified to send emails via other popup windows using Samba to inform me of a problem. Note that this script only tells of a ping or communication problem; the server could be down due to DoS but actually still running other activities, such as databases.

Conclusion

It can be easy to write many useful systems administration scripts that will save you time and money on a day-to-day basis. Many of the expensive commercial tools cover the same aspects and provide similar results. All of these scripts were written with standard UNIX commands and are therefore easy to adapt. There are many other free tools on the Internet that can be downloaded and adapted too, such as the performance analyzing scripts written by Adrian Cockroft using his own scripting language (http://www.sun.com/951001/columns/adrian/column2.html). In some cases, a script might not be enough and you might need to migrate to some other scripting language, or in the worst case, write some simple C or other language program to handle the problem.

Mark Prager is the Senior UNIX Manager at Seabridge and has a 15-year history with the software industry. He is skilled in many aspects of the software industry, including software engineering, computer security, and network planning. He is also a frequent contributer to the CCIUG newsgroup and experienced in the management of Rational's Clearcase and Multisite.