Cover V04, I02
Article
Figure 1
Figure 2
Figure 3
Listing 1

mar95.tar


Finding Disk Hogs

Larry Reznick

"Who's hogging my disk space?" Ignoring the conceit of ownership that we all share, that's what a group manager wanted to know. He'd written a script to tell him who was using disk space on one directory tree important to his operations. That was a good idea, but I wanted something more thorough. I wanted to know about the space usage throughout every disk drive. Furthermore, I wanted one script to find this information on both the HP-UX and the SunOS systems at the site.

dusage (Listing 1) collects disk usage information across every partition on a set of systems. The script will check a set of systems specified on the command line or will check a default set of systems built into the script. You may give a megabyte threshold to reduce or increase dusage's output. Only users whose space usage exceeds the threshold are shown in the report. You may specify the threshold with or without specifying which systems to check, but to specify systems you must give a usage threshold.

Disk Usage Report

dusage's report shows the amount of space used by each mounted filesystem and the amount of space used by each user. Figure 1 shows a full report of the script's default systems, using the 50 megabyte default threshold. Figure 2 shows a report for only one system using a 20 megabyte threshold. The smaller the threshold, the more users who show on the report. dusage accepts a threshold as small as 1 megabyte, which shows nearly every user storing files on the disk.

dusage's report is designed not only for easy reading by administrators and managers, but also for easy parsing when feeding its output into other scripts for further analysis.

As a heading, the first line identifies the megabyte threshold used for the report. Two lines are skipped between each system's report and a single line is skipped between each disk drive's report. Each disk's report data is single-spaced.

System subheadings are preceded by four equals signs (====) to simplify extracting the report for one system from a larger report containing many systems. Assume that you've saved a large dusage report to a file by redirecting the script's output. You can extract a single-system's report by running

egrep -n ==== reportfile

This command shows a list of every system named in the file and the line number where each system's subreport begins. Say that the report you want begins on line 53 and the next subreport begins on line 69. Use the command

tail +53 reportfile | head -16

to extract the 16-line subreport. If the subreport is the last one in the report file, you don't need to pipe the tail program's output to the head program.

Each system subreport shows all the disk partitions for that system. dusage shows the names of the partitions' mount points using remote system notation:

systemname:/mountpath

Beneath the partition's designation is a line containing space usage figures for the whole partition. These figures come from the df program, but they're scaled to use megabytes instead of df's usual kilobytes output. Total space usage lines are surrounded by parentheses to make extraction of these lines simple for scripts that want to analyze dusage's output.

The usage percentage is calculated the same way df calculates it, but shows the value rounded to two decimal places. df rounds to the nearest integer. The total shown ("Tot:") is typically greater than the sum of the "Used:" and "Avail:" space because "Tot:" reflects certain space used as overhead by the system. The "Usage:" percentage reflects the actual space available from the "Used:" and "Avail:" sum, not the potential space as shown in the "Tot:" field.

User usage lines follow the space usage line. These lines show each user whose usage exceeds the threshold. User lines show in order of most space used to least space used. Three tab-separated columns show the total megabytes used, the name of the user holding that space, and the percentage of total space used by the user.

UNIX allocates disk space 1K (two 512-byte blocks) at a time. Total megabytes used represents the total allocated space in 1K allocation units held by a user, not merely the sum of a user's files' bytes. Usage percentages come from the relationship of the user's files' usage to the partition's available space, which is the sum of the partition's "Used:" and "Avail:" figures.

Disk Usage Script

The heart of dusage is the quot program. (Refer to quot(8) on SunOS, and quot(1M) on HP-UX and on other versions of System V). The -a option causes quot to examine every mounted filesystem on a system and report who is using space and how much. quot is perfect for dusage's needs, because quot's report gives the total 1K allocation units for each user in order of most used to least used.

Unfortunately, quot only gives this information when run by a root user. Therefore, dusage requires the user to have an effective UID of zero. If you need dusage's report frequently, run dusage in root's crontab. Because dusage sends its output to the standard output, cron automatically mails the output to the crontab owner unless you redirect that output to a file. If you want many people to read that file, be sure to set appropriate owner and group IDs and permissions as part of the cron job.

Listing 1 shows dusage, a Bourne shell script. dusage has been tested to work correctly on HP-UX and SunOS systems. It may work unchanged on other versions of UNIX, but I've only tested it on those two systems.

SYSTEMS is set at the beginning of the script to make this variable easy to find and modify. The SYSTEMS variable holds the default systems that dusage reports about when system lists are not included in its command line. Listing 1's settings are dummies. Modify your SYSTEMS variable to hold the system names you're concerned with most frequently.

Because root privileges are required to run this script, the PATH variable is restricted to /bin, /usr/bin, and /etc. These directories hold most of the programs dusage needs. Other programs have their paths fully spelled out when needed.

PROGNAME is set to the base name used to invoke the script so that error messages refer to the correct invocation name. Later script actions cause dusage to lose access to $0, so I reserve its value now.

The next phase of the script introduces a method for making this script as portable across UNIX systems as possible. Because of my initial development requirements, this script runs on HP-UX and SunOS. HP-UX is System V-based and SunOS is BSD-based. There are enough differences between them to drive a portable script developer nuts. The script has to deal not only with program differences on the executing system, but also with differences on the remote systems from which usage figures are collected. The executing system may be different from the remote, examined system. Designing the script so that it could collect HP-UX figures even though run on SunOS, or vice versa, was interesting. Throughout discussion of the rest of this script, keep in mind that the executing system may run a different version of UNIX than the target system.

WHICHSYS initially holds the operating system name on the executing system, although later it will hold the name of the target system. The next test figures out which UNIX system is executing by looking at the first five characters. dusage assumes that if those five characters aren't "HP-UX," they're "SunOS." Only an if...else condition is needed for this test. If you find that simple assumption won't work for your needs, change this to a case statement. Make the case execute expr to extract the five characters, just as dusage does. Look for HP-UX and SunOS using the usual case pattern matching, and then add patterns to match whatever else your UNIX systems require.

At this point, the script sets the remote shell command (RSHCMD), the awk command (AWKCMD), and the echo command (ECHOCMD) variables. HP-UX uses remsh, but SunOS uses rsh. I suspect HP-UX names the command differently to avoid conflict with the restricted shell's name. Nevertheless, System V uses rsh(1), so I view HP-UX as being different here. However, remsh is in dusage's PATH. SunOS presents a problem by keeping its version of rsh in /usr/ucb, which isn't in dusage's PATH.

dusage uses awk for its data collection, calculating, and formatting. Some UNIXs keep the obsolete version of awk hanging around with the new, more thorough version of awk. SunOS does this. Most System V versions of UNIX I've worked with do this also, naming the older version awk and the newer version nawk. On systems I administer, I prefer to rename the old version of awk to oawk and then make a hard link from awk to nawk. This keeps both versions available but forces the more powerful, new version as the default awk without taking much extra system space. However, because I wrote this script to work despite someone having set that link, AWKCMD points to awk for HP-UX and nawk for SunOS, those systems' respective versions of the more complete awk. Use additional case pattern matches to direct your AWKCMD another way.

ECHOCMD points to the System V version of echo. HP-UX automatically uses that version, but SunOS must go to /usr/5bin to get that version. System V's echo supports many embedded control codes, using a C-like escape notation with a backslash. BSD's echo has no such support. dusage needs embedded newlines to make its messages easier to read.

With these three primary support programs selected, dusage can move on to examine its command line. If ECHOCMD isn't set correctly, not even the usage message will output portably because it uses an embedded newline (\n).

THRESHOLD comes from the first argument on the command line. However, the user doesn't have to give that argument. Without it, THRESHOLD is set to 50. If you prefer some other default threshold, set the number of megabytes after the colon-hyphen (:-) notation. That threshold must be a positive number. Any negative number or zero is assumed to be an error and shows the usage message. This way, if the user starts dusage with a negative number threshold, such as "-50," or with a non-numeric argument, such as "-h" or "help," dusage tells the user how to use it. Figure 3 shows an example of this help message.

For the rest of the script, the user must be a superuser. Using "id -u" to see if the effective UID is 0 prevents the script from going further. Figure 1 shows an example of this message. Usage messages and error messages are redirected to the standard error device to differentiate them from the report's normal output. However, if a cron job redirects dusage's standard output to a file, cron still mails the standard error to the root user.

The next test checks whether systems were named on the command line. Because systems cannot be named without identifying a threshold, the test expects at least two arguments on the command line for a system name to be present. If no such names are on the command line, the default SYSTEMS value is set into the positional parameters. When the command line contains system names, the threshold is shifted out of the positional parameters. Whichever way this test goes, when it is finished, the positional parameters contain only system names.

Keying to the Remote System

Report generation begins by showing the current run's threshold. For each system named in the positional parameters, dusage must adjust itself to use certain programs on the remote system. WHICHSYS is reused to hold the remote system's OS name. Appropriate df and type commands are selected, according to whether the remote system is HP-UX or SunOS. On SunOS, df outputs a report showing disk free space in 1K units. HP-UX's df shows the same information in 512-byte blocks, but doesn't support System V's -k option to show it in 1K units. Instead, HP-UX has a different program, bdf, to show the same format as SunOS's df and System V's "df -k." Use a case statement to handle other operating system differences.

Discovering the proper path to certain commands became more complicated than the simple if...else used so far. On HP-UX and many System V installations, root users start up a remote shell using either Bourne shell (sh) or Korn shell (ksh). On SunOS, root users start up C-shell (csh). Even though dusage runs under Bourne shell because of the #!/bin/sh line at the top of the script, when a remote shell (remsh or rsh) starts, the default shell on the remote system is not necessarily the same shell. That means the command to find the full pathname of a program differs according to the target system: /usr/ucb/which on SunOS's C-shell, and type on Bourne shell or Korn shell.

This different shell problem gets hairier on HP-UX. According to the remsh documentation, remsh's initial PATH excludes the /etc directory, which some dusage commands need. So, the TYPECMD used on HP-UX must contain the explicit PATH setting used by dusage, including /etc.

TYPECMD throws one more gotcha: type's output style is different from which's style. Fortunately, the TYPECMD's output format always shows the pathname last. Pipe the TYPECMD's output to the AWKCMD to extract the last field's value, no matter how many fields there are. This delivers the full pathname. DFCMD is set to such a full pathname for the remote system's command, be it bdf or df. USECMD is similarly set to the usage command, quot. An extra quoted string, " -a" with a space before the hyphen in the quotation marks, follows the output of the shell substitution so that quot will show results for all mounted filesystems.

Collecting Remote System Statistics

Once all the target system configuration work is done, dusage can begin collecting disk usage figures for the remote system.

Skipping two lines, dusage announces the system name for the subreport. It then remotely executes the appropriate USECMD and pipes USECMD's output into a local copy of AWKCMD. AWKCMD collects and formats the figures.

Two kinds of output lines come from USECMD (quot): an initial line identifying the current disk partition, and a set of secondary lines identifying the users taking space on that partition. Relatively little work goes into handling the secondary lines. Much more work goes into developing the partition headings for the report from the initial lines.

Initial lines from quot identifying the current disk partition always refer to the physical device pathname. Because the dev directory is always part of that pathname, despite any subdirectory classifications, seeing if the line contains dev seemed most portable. So, the AWKCMD pattern

$0 ~ /dev/

looks for those three letters -- slashes surround regular expressions in awk patterns -- and executes the action associated with a partition's initial report line.

quot delivers its partition identifier showing the device pathname and then, in parentheses, the mount point pathname. If you prefer to see both, use $0 in printf()'s second parameter. However, assuming that most managers analyzing the report may not care about the partition's device name, I extracted the mount point pathname from $0. To extract the mount point, set AWKCMD's match() function to find all the characters within parentheses. Those parentheses needed to be double-escaped -- using "\\(" not just "\(" -- because HP-UX's awk and SunOS's nawk use differing regular expression rules. Doubling the escape character eliminated the difference.

AWKCMD's match() function loads the global variables RSTART and RLENGTH. These hold the relative starting position and the relative length of the matched string. Used with the AWKCMD's substr() function, those global variables deliver the mount point pathname contained in quot's initial line. Because the regular expression includes the parentheses in the match, I add one to RSTART to skip the open parenthesis and subtract two from RLENGTH to ignore both the open and close parentheses. That extracted name is formatted into dusage's output showing remote system notation. The report skips a line before and after this output.

Finding the remote system's total space, as reported by DFCMD, requires separate remote shell execution. Recall that the remote shell invocation of USECMD has already delivered its output to this AWKCMD. Another RSHCMD won't interfere with that output. Retrieving this new RSHCMD's output from the DFCMD from inside the AWKCMD is a little tricky. It not only requires the RSHCMD and the DFCMD, but also requires DFCMD's output to pipe through egrep. egrep extracts the only relevant line from DFCMD's output. To make AWKCMD execute RSHCMD and retrieve the output, I pipe the command into getline.

AWKCMD's sprintf() function creates a string from the several variables that form the remote shell DFCMD with the egrep pipe. This string is formatted and assigned to the sysspace variable. Expanding the string in front of the pipe to getline runs the command and delivers the output to AWKCMD. Because of egrep the command only outputs one line. getline assigns each word from the output line to AWKCMD's positional parameters. Doing this replaces USECMD's input line, but the program is finished with that. Total, used, and available space figures come from USECMD's output. dusage converts them from DFCMD's kilobytes format into megabytes, calculates a maximum space value for the usage percent figures, rounds these figures to two decimal places, and prints them as floating point numbers with "M" characters to designate them as megabytes.

AWKCMD holds open any file associated with a pipe to getline. After printing the partition's space usage figures, dusage explicitly closes that file; otherewise, AWKCMD will run out of file handles. AWKCMD may only have one pipe opened at a time. Because a similar pipe gets opened for each partition on every system, dusage closes each pipe when finished with it so the next one will open.

Finally, next skips all remaining patterns and actions for this initial input line. Without executing next, AWKCMD would try to apply the remaining patterns and actions, and could produce very strange output.

All secondary lines coming from USECMD are handled by AWKCMD's second pattern. In the first column of input, quot delivers the number of 1K blocks used by every user on the target partition. Is that number, divided by 1024 to translate kilobytes into megabytes, greater than the threshold? If so, this line identifies a user holding more space than the threshold. The script then formats and prints the usage amount in megabytes, along with the user's name and the usage percent.

Conclusion

dusage may point out surprising disk space usage figures. It helps administrators and managers identify who needs a nudge to clean up files. It can also help them identify which users typically use the most space when they must reorganize file space onto new servers.

About the Author

Larry Reznick has been programming professionally since 1978. He is currently working on systems programming in UNIX, MS-DOS, and OS/2. He teaches C, C++, and UNIX language courses at American River College and at the University of California, Davis extension.