Article

Weeding Your System

Larry Reznick

Log files and temporary files are like the weeds in your yard: if you don't take care of them, they take over. A number of standard UNIX files accumulate during the normal operation of the system. You may also find that specialized applications at your site create files that accumulate over time. If you don't pay attention to those files, they'll take over your filesystem.

The system handles many log files automatically; others, it either doesn't handle very well or doesn't handle at all (the printer request file comes to mind as an example of one that isn't cleaned up automatically but should be). Fortunately, the system regularly creates and updates many of these files in known locations. That regularity makes them good candidates for cron(1) jobs to do the cleaning.

If you check root's crontab(1) (using the "crontab -l" command while you're logged in as root), you may find one or two jobs related to cleaning up the files. For example, SCO SVR3.2v4 has /usr/lib/cron/logchecker, and /usr/lib/cleantmp. These two shell scripts check various files and eliminate, prune, or age them.

The simplest solution is to eliminate the file. Usually the application writing to the file will recreate it. One flaw with this solution is the data isn't around any more when you're giving a problem a postmortem exam. Another solution is to prune the file, keeping only the most recent records. Newer records are typically added to the end of the existing file. Periodically stripping the older records at the top of the file reduces the file's size while still keeping the records that could aid in a postmortem. Finally, you could age the file. Copy the existing file to a file with a similar name or to another directory specially set up to hold these aged files. Then recreate the file expected by the application. Once finished, the entire past log is present for your postmortems until the next aging time. Eventually, aging will remove the oldest version.

Logchecker

This script checks the size of the cron log file. If you want to keep track of cron's doings, look for a file named /etc/default/cron. You can tell cron to keep a log of everything it does by setting:

CRONLOG=YES

Usually, CRONLOG is set to NO, which doesn't keep the log.

The log file is named /usr/lib/cron/log. If you've been keeping the cron log and haven't been monitoring it, you should decide whether it is worth keeping. If you're going to keep it, the logchecker script provided by SCO ages the file for you.

Cron runs logchecker twice a week. I have it set to run at 1:03 AM on Sunday and Thursday. So, the crontab entry looks like this:

3 1 * * 0,4 /usr/lib/cron/logchecker

When you're root, changes you make to an existing file don't affect the file's ownership or group settings. If you create a new file, though, the file gets root's permissions, ownership, and group. The logchecker script sets the umask to 022, taking away write permission from both the group and other users.

Logchecker allows the cron log file to grow as large as four blocks short of the ulimit. You can set a variable to this limit as follows:

LIMIT=`ulimit`
LIMIT=`expr $LIMIT - 4`

Before going any further, the script tests whether the log file exists. If not, the script simply quits. Given that the log file exists, use du(1) to find its block size. If the block size is greater than $LIMIT's value, copy it to a file named olog. The olog file is kept in the same directory as the log file. The log file is recreated by using the command:

>log

This creates an empty file named log and gives it the permissions set in the umask and root's ownership and group. The script changes the group to bin.

Using a ulimit-based high-water-mark test is not a bad idea. One problem with this test is that some systems have a very large ulimit, which allows this file to get ridiculously large. Four blocks leaves about 2Kb of space for the cron log to grow before crashing into the ulimit. Presumably, cron will execute logchecker again before the cron log uses that space. But, with a large ulimit, this is just taking up too much space, and logchecker may not act before the file grows to unmanageable proportions. One of the following solutions could help:

1. Lower the high-water-mark by changing the four to a larger number.

2. Force a constant block-size-maximum limit into the test.

3. Use the ulimit command in the script to temporarily lower the ulimit. This applies only to the run of this script.

You can apply the third technique without changing the script at all. Set the ulimit within the cron entry just before executing the logchecker script:

3 1 * * 0,4 ulimit 1024; /usr/lib/cron/logchecker

Using 1024 cuts the log file off at just under half a megabyte (remember that the high-water-mark will subtract four from the ulimit). That seems like a workable size; large enough to be ignored for a while, but not so large that you can't wade through it if you have to. With one aged olog file and the active log file online at once, this setting wouldn't use more than one megabyte of space total.

Cleantmp

I've set cron to run cleantmp every day at 1:06 AM. The crontab entry looks like this:

6 1 * * * /usr/lib/cleantmp > /dev/null

Notice that this time is shortly after when logchecker executes. I schedule most of the cleanup work during the 1:00 AM hour just before the system backup, which runs at 1:45 AM.

Cleantmp doesn't age the temporary files. It eliminates them based on a time limit. Cleantmp uses a file named /etc/default/cleantmp, which contains two variables telling the script how to act: FILEAGING and TMPDIRS. FILEAGING identifies the number of days old a temporary file can be before it qualifies for removal. TMPDIRS tells which temporary directories to apply cleantmp's actions to.

Edit the /etc/default/cleantmp file to set the FILEAGING and TMPDIRS variables the way you want. You may want to clean up only files older than 7 or 14 days. Of course, if your system has lots of temporary files created by users running applications, you may want to relieve the burden on the filesystem more frequently, such as every three or five days.

Change the TMPDIRS variable to add any extra directories you need to the default ones already there. Frequently, system administrators move temporary files out of the /tmp directory into /usr/tmp. This takes the burden off the root filesystem, which is often a small partition on the boot drive. If all of the available workspace on the root partition is used by the /tmp files, the entire system can be brought to its knees. The /usr/tmp directory is often on another partition -- or, even better, on another drive. If you've configured certain applications to put their temporary files in some other directory, add that directory to the /etc/default/cleantmp TMPDIRS list.

Cleantmp reads the /etc/default/cleantmp file using sed. It extracts the FILEAGING variable's value with the following regular expression:

/^FILEAGING=[0-9]*$/s/FILEAGING=//p

This looks for FILEAGING (notice that it must be spelled with all capital letters) at the beginning of the line, followed by an equals sign, then followed by zero or more digits, which make up the rest of the line. If that search is successful, it substitutes the "FILEAGING=" part with nothing, then prints the result. Substituting that way deletes everything but the number following the equals sign. A cleantmp shell variable gets the number.

The TMPDIRS search uses a slightly different regular expression:

/^TMPDIRS=[/]\{0,1\}.*$/s/TMPDIRS=//p

Searching for the all-caps variable name is the same as before, but what follows the name differs because that isn't expected to be a number. The bracketed slash is a readable way to isolate the slash character to prevent sed from interpreting it as the end of the search string. Another way to do this is by escaping the slash ("\/") but that makes the expression difficult to read. The escaped braces surround a comma-separated range for a repeat count. So, the combination of the slash and the repeat count allows zero or one slash; that is, the slash isn't required, so the directory could be a subdirectory of the current directory. That's not a very good idea, though. It is best to use absolute pathnames when using rm. At any rate, the rest of the regular expression matches any character. So, the cleantmp shell variable gets the list of directory names.

The rest is simply a loop through the directories named, checking to see that each name is a directory. If it is a directory, the script runs a find(1) command to rm all files older than the FILEAGING days. Younger files are left alone. If there is no such directory, cleantmp mails root an error message that the directory either doesn't exist or isn't mounted.

Cleanup

Another general purpose cleaning program SCO provides is /etc/cleanup. Cron runs this script only on Sundays at 5:17 AM. Here's the crontab entry:

17 5 * * 0 /etc/cleanup > /dev/null

The backup runs on Tuesday through Saturday after logchecker, cleantmp, and other cleaning jobs have finished. Saturday and Sunday, the company is closed, so there isn't any need for backup on those days. Because the system is quiet on Sunday, this is a good time to run other cleaning jobs.

Cleanup processes several files: /usr/adm/sulog, /etc/log/filesave.log, /etc/wtmp, and core files. su(1) maintains the sulog file. An /etc/default/su file contains several settings controlling how su works. Of those, the SULOG variable gives the full pathname for the file that logs every use of the su command. If SULOG has no entry, no sulog file is kept. The standard place for it is /usr/adm/sulog. Cleanup simply copies sulog to Osulog, then reinitializes sulog using the command:

> /usr/adm/sulog

This method keeps the original permissions, ownership, and group.

volcopy(1) updates the filesave.log file. volcopy copies the entire UNIX system from one device to another, like-sized device. Logs of its activities go to /etc/log/filesave.log. Cleanup takes a little extra effort in dealing with filesave.log. First, it tests whether the file exists. If so, it moves -- not copies -- the file to Ofilesave.log. Then, cleanup recreates the file using the command:

> /etc/log/filesave.log

Unlike the sulog file, this file ceases to exist. Recreating it this way gives the file root's permissions, ownership, and group. Cleanup must execute three more commands to fix this: chown root, chgrp sys, and chmod 666.

The wtmp file (look up utmp(4)) holds user facts, including the login name, login device, login pid, and login time. These facts are accumulated for each login and are used by several programs, including who(1). This file can get very large. The cleanup script doesn't try to preserve older versions. It simply eliminates the old version with the command:

> /etc/wtmp

As before, this simple command clears the data but keeps the permissions, owner, and group settings.

find looks throughout the file system for any core file with an access time older than seven days. Presumably, nobody would still be debugging such old core files. If it finds any such files, it removes them.

The cleanup script is an excellent place to put any special cleaning tasks you might have. For example, the /usr/adm/messages file is a top candidate for pruning. This file receives all system error messages, including bootstrap messages. dmesg(1) shows the error messages and concatenates them to the /usr/adm/messages file. Alternatively, /dev/error, a read-only device, holds these messages until they're read. The /etc/rc2 script usually runs dmesg or reads from the /dev/error whenever the system is booted to init(1) level 2. Crontab may contain a dmesg command to periodically dump the error messages to /usr/adm/messages.

The messages file is great for discovering all kinds of nasty problems. Check it often. However, this file can grow to immense proportions if not pruned periodically. I added the following commands to the cleanup script to prune messages:

tail +`expr`wc -l </usr/adm/messages\` - 1000` /usr/adm/messages >/tmp/msgs

mv /tmp/msgs /usr/adm/messages
chmod 644 /usr/adm/messages
chgrp bin /usr/adm/messages
chown bin /usr/adm/messages

The tail(1) command is the tricky part. I want to keep the most recent 1000 lines in the file for system debugging. Because new messages are concatenated, you'd think I could just use tail -1000 on the file. No such luck.

It turns out that tail's -arg has an internal 4096-byte limit. While that may not be so bad for most uses of tail, it prevented me from pruning this file the easy way. Tail also has a +arg that says "take the end of the file starting from the +arg line number." That'll do the trick if I just figure out what line number that is.

The wc(1) program counts the lines in the messages file. Because the filename is output along with the count when the file is named on the command line, I deliver the data through redirection, for which no name is known or printed. Then, I pass the line count to expr(1), which subtracts 1000. That delivers the line number for tail to start with. I place the last 1000 lines into a temporary file, overwrite the existing messages file with the temporary, and finish by adjusting the permissions, owner, and group.

Purging Old History Files

Some applications create special files that aren't much use after some time elapses, yet they hang around on the system taking up precious space. For example, for one application I wrote software that produces work orders. Every work order is sent to the printer after all of the data is assembled. The software allows the client to reprint past work orders, so the work orders are actually produced into a file named after the work order invoice number, and then sent to the printer. All of those files collect in a specific directory, called REPRINT_DIR. Reprinting is only a matter of sending the file to the printer again.

After a while, many old work orders have accumulated on the system, cluttering up the REPRINT_DIR. Only 140 of the files appear on the software's screen. Why bother keeping more than 140? So, I wrote a script to clean them up. The primary code looked like this:

cd $REPRINT_DIR
if [ `ls | wc -w` -gt 140 ]
then
find . -type f -mtime +14 -print |
xargs rm
fi

The test checks whether there are more than 140 files. Filenames output by a simple ls can be counted by wc as one word per filename. If that count is greater than 140, the code kills the oldest ones. At first, I decided that any file older than 14 days was probably useless. The names of these files were passed to xargs(1), which invoked rm for all of them at once. (This method is more efficient than using find's -exec option, which would execute rm once for each file instead of once for all of the files.) However, a beta site showed that this method wasn't a good idea.

It turned out that one of the beta test sites generated so many work orders in a day that they'd run over the 140-workorder limit in two or three days. By 14 days, there were more files than anyone cared about. At that point, we decided to keep only the most recent files. There actually was no need to reprint the work orders beyond a couple of days of the initial work order anyway. I replaced the find command line with the following:

ls -t | tail +140 | xargs rm

The -t option delivers a time sort with the most recent files first. Because I didn't use -l, only the names appear in a single-column list. Tail cuts that list off starting at line 140, which is the 140th filename. It delivers all of the names at the end of the list to xargs, which removes those filenames. The first 139 names remain, which are the 139 most recent work orders.

I put the script that purges the old work order files into the crontab to run once every day. That way, the previous day's files push off the equivalent number of older files.

General Purpose File Aging

All of the techniques shown so far work for many kinds of files strewn throughout the system. But these techniques require specific changes to specific files or require other special treatment, as the work order files example did. The techniques are worth knowing and can be applied in many scripts, but I wanted a script that I could use to age many files.

My objectives were to:

1. name the files that needed aging on a script's command line,

2. be assured that copies of the file would go to a known location,

3. replace older versions of the file with more recent versions,

4. maintain the original file's permissions, ownership, and group settings on the older versions and the current version.

I didn't care to truncate the original file as part of the aging process, although I could add such an enhancement if necessary. I envisioned this aging process as an online file backup for quick problem recovery. For example, you mustn't truncate database files or the password file. Yet, if a problem appears in the file, I might resolve it more quickly from an online copy of the file than from a copy found and extracted from tape. The tape is still available in the event of major catastrophes.

This method wouldn't work well for aging extremely large files because of the disk space required, but small to medium-sized files won't use enough disk space to cause a problem. You specify which files to age by naming them on the script's command line. The script doesn't care about the size of the file. The dupback program in Listing 1 implements this idea.

The aged files will be placed in /usr/local/backup. If that directory doesn't exist, the script creates it. mkdir's -p option creates every missing subdirectory along the way, in case any of them is missing. Notice the logical OR operator ("||"). If the test -d fails, the first part of the OR is false, so the second part must execute to complete the OR condition. If the test -d succeeds, which means that the directory already exists, the OR doesn't bother executing the mkdir in the second part.

The FILES variable receives a copy of every filename on the command line. These filename arguments must be saved because the set command inside the for loop will replace the positional parameters.

Inside the for loop, each filename coming from the command line is stored in a variable named f. The set command takes the output of ls -li, which shows the long listing of the current file in $f. From the long listing, the program can figure out the file's owner and group settings. I couldn't simply use ls -l, though. When the subshell is finished executing the ls -l, it delivers the output to the set command. Because this dupback program is intended to work with regular files, the permissions output will have a hyphen ("-") for the mode setting. Thus, the permissions, such as "-rwxrwxrwx", will appear to the set command as an option list. To prevent that, I added the -i option, which causes the inode number for the file to output. Because the inode always appears first, it prevents the permission settings from appearing as a set option list, at the trivial cost of shoving the owner and group names into the next positional parameter. So, $4 has the owner name and $5 has the group name.

An inner for loop drives the aging process. I want to create a duplicate of each original file using the same name but with the suffixes .1, .2, and .3 added. When doing the next aging, the .2 file overwrites the .3 file, the .1 file overwrites the .2 file, and the current file overwrites the .1 file. To preserve the oldest files, the loop numbering goes backward from 2 to 1. The loop counter doesn't need 3 because an expr calculation in the mv command creates it.

The code first tests whether the file about to be mv'd exists. If it doesn't exist, mv gives an error message. This test uses the logical AND ("&&") to be sure that mv is executed if, and only if, the file exists. If the test fails, the entire AND is considered false, so mv won't execute.

Once finished, the loop has turned file.1 into file.2, and file.2 into file.3. At this stage, there is no file.1 anymore. Keep in mind that mv keeps the original permission, owner, and group settings. All that remains is to copy the original file using the file.1 style name. The gotcha is that cp creates a brand new file, which means that the permissions, owner, and group will be set according to the user executing the program.

After the cp, the permissions, group, and owner, in that order, are set. The original file's group and owner settings are retrieved from the ls output stored in the positional parameters. The permission settings are more difficult to acquire. I could have gotten this information from a C program, but I really wanted to do this entirely in shell script. So, I wrote an awk program to translate the ls -l permissions field into the octal format usable by chmod. Listing 2 shows the getperm script that does this translation. Because getperm was written to provide the octal permission value and the filename, getperm's output supplies everything that chmod needs.

When dupback is finished, the original file still exists in the original location. It is duplicated in /usr/local/backup/filename.1. The previous filename.1 is now named /usr/local/backup/filename.2, and the previous filename.2 is now named /usr/local/backup/filename.3.

A cron job can run dupback with a command-line list of files to be aged like this. The next time something goes wrong with one of those files, you can fall back on the aged copies before resorting to the backup tape. Combining this with the other techniques for eliminating, pruning, or aging files, you should find little difficulty keeping the weeds from taking over the system.

About the Author

Larry Reznick has been programming professionally since 1978. He is currently working on systems programming in UNIX and DOS. He teaches C language courses at American River College in Sacramento. He can be reached via email at: rezbook!reznick@csusac.ecs.csus.edu.