Cover V11, I02

Listing 1
Listing 2


Increasing Reliability Through Forensic Operations

Bob Apthorpe

If you work in a large organization, you may know of a few nightmare systems. These machines provide vital services but are so laden with legacy code and historical debris that nobody seems to know precisely what they do anymore. Systems like this are the bane of every systems administrator, but can be great learning opportunities. I recently took on the challenge of deciphering and documenting one such nightmare system.

The Situation

I am responsible for a farm of seven middle-tier Web application servers known as jpdaemons. The jpdaemons tier provides various services to Web servers, such as stock and weather information, and personalized data, such as user preferences. These middle-tier machines broker connections between the front-end Web servers (FEWS) and back-end database servers. All the machines run Solaris 2.6 and have a shared non-root role account for running daemons accessible by all developers.

The Immediate Problems

These machines suffered from many problems, the most pressing of which were the following:

  • The middle-tier machines were running out of resources (disk, swap, CPU).
  • There was no definitive, public list of the daemons running on the middle-tier servers.
  • All the machines should have been identical but weren't.
  • There were no troubleshooting documents, nor were there start/stop scripts under /etc/rc3.d/ to restart daemons in case of reboot.

These factors combined to make maintenance and troubleshooting extremely difficult.

Operational Constraints

Besides these problems, I faced some significant operational constraints:

  • The systems are occasionally heavily loaded, so resource-intensive tools could not run continuously.
  • User access could not be easily revoked.
  • The machines had to be left in service during diagnosis, so offending processes could not be arbitrarily halted.
  • There was no budget for commercial support or tools to solve this problem.
  • There were substantial communications barriers between the three affected offices (my office in Austin, TX, our corporate data center in Redwood City, CA, and the engineering office in Tokyo, Japan).

Suffice to say, it was not feasible for everyone to sit down at the same table and meet to solve the problems.

I recognized that this project was going to be a challenge. The underlying causes of the problems were social and political, not technical, and the solution would take time to implement. Sadly, understanding the managerial problems didn't absolve me of administrative responsibilities. To preserve my sanity and health, I needed to find out what these systems did and write troubleshooting documentation for the junior admins on our NOC staff. Also, I needed a list of specific problems to convince engineering management to fix their change control process and to gain my management's support in case I needed to take more drastic actions to preserve system reliability.

Step 1: Finding the Logs

My immediate concern was stopping the daemons' logs from filling the machines' disks. When this happened, the NOC would investigate and page me after realizing they didn't know which logs could be deleted. My first task was to identify rapidly growing files in a filesystem.

Luckily, I was familiar with FCheck ( from Sys Admin ("FCheck: A Solution to Host-Based Intrusion Detection" by Ron McCarty, December 2000, FCheck is normally used as a file integrity checker as part of a host intrusion detection system. In this case, it made a perfect developer detector. I configured FCheck to look at likely directories, took a baseline snapshot, waited a few hours, then took another snapshot and compared the two. Configuring FCheck on one machine took some time, but once that was done I could quickly install it on the other six machines.

After I found the offending logs, it was a simple matter to get them under control. I wrote a simple script to compress, truncate, and delete the logs and I updated FCheck to ignore known log files (Listing 1).

I assumed that the developers were writing logs for a reason, so I didn't just delete them outright unless there was no other option (e.g., the Oracle logs.) The fuser command also helped determine which user and process was writing to a particular log. Also, because I had FCheck installed and tuned, I could automate it to watch for other changes to these systems while I continued my analysis.

Step 2: Associating Daemons with Ports

Next, I found all the running daemons and determined the ports used by each daemon. I was lucky because all the daemons were running under a single developer account (excited) and not as root, though my analytical method wouldn't have changed substantially even if the daemons had been running as root. I created a concise list of processes running under developer accounts:

# /bin/ps -u excited -o args | /bin/sort -u | /bin/egrep 
-v COMMAND ../../bin/rel5/univdbd -port 90017 -connect 
zzzzzzzzzz/zzzzzzzzzz@prel -numchil ../../bin/rel5/univdbd 
-port 90019 -connect zzzzzzzzzz/zzzzzzzzzz@prel -numchil 
../../bin/rel5/univdbd -port 90023 -connect vvvvv/vvvvvvv@oesjp 
-numchildren 10 ../../bin/rel5/univdbd -port 90024 -connect 
zzzzzzzzzz/zzzzzzzzzz@prel -numchil ../../bin/rel5/univdbd 
-port 90049 -connect vvvvv/vvvvvvv@oesjp -numchildren 10 
../../bin/rel8/univdbd -port 14001 -connect zzzzzzzzzz/zzzzzzzzzz@prel 
-numchil ../../bin/rel8/univdbd -port 14002 -connect wwwwwwww/wwwwwwww@prel 
-numchildren ../../bin/rel8/univdbd -port 14003 -connect vvvvv/vvvvvvv@oesjp 
-numchildren 10 ../../bin/rel8/univdbd -port 14004 -connect vvvvv/vvvvvvv@oesjp 
-numchildren 10 ../../bin/rel8/univdbd -port 14005 -connect 
zzzzzzzzzz/zzzzzzzzzz@prel -numchil ../../bin/rel8/univdbd -port 14006 
-connect zzzzzzzzzz/zzzzzzzzzz@prel -numchil ../../bin/rel8/univdbd 
-port 14007 -connect zzzzzzzzzz/zzzzzzzzzz@prel -numchil ../../bin/rel8/univdbd 
-port 60057 -connect wwwwwwww/wwwwwwww@testprel -numchil ../../bin/rel8/univdbd 
-port 90054 -connect zzzzzzzzzz/zzzzzzzzzz@prel -numchil ./nerd -kids 0 
-port 80020 -qprefix real-jp -log ./real-jp.log -qcount 1 
./phd_jp -port 8111 -cwd /excite/daemon/phd_jp -children 30 -loglevel 2 
-atime /usr/local/bin/perl -T ./
RELEASES/release/sol24/acctd -port 60005 -connect xxxx/ 
-n RELEASES/release/sol24/authd -port 60003 -connect xxxx/ 
-n RELEASES/release/sol24/infod -port 60006 -connect xxxx/ 
-n sol24/jpminibbs -port 60222 -connect yyyyyyy/yyyyyyy@jpurs -numchildren 10 
-log sol24/jpnewtransd -port 60797 -numchildren 20 -loglevel 10 -logfile 
./logtrans/ sol24/jpnewtransd -port 60798 -numchildren 17 -loglevel 10 
-logfile ./logtrans/ sol24/jpnewtransd -port 60799 -numchildren 20 
-loglevel 10 -logfile ./logtrans/ sol24/jpursplusd -port 60680 -connect 
uuuuuuuuu/uuuuuu@jpurs -numchildren 10 -l
Note that under Solaris 2.6, the default (System V) version of ps truncates the argument list. Berkeley ps (/usr/ucb/ps) yields more information at the expense of format configurability. To generate the same results as above, I used:

# /usr/ucb/ps auxww | /bin/egrep '^ *excited' | /bin/cut -c62- | /bin/sort -u
Field width is variable, so the arguments to /bin/cut need to be adjusted accordingly.

Next, I listed the services listening on the host:

# /bin/netstat -na | /bin/egrep LISTEN | /bin/sort -n -b -k 1.3,1.7  
      *.22                 *.*                0      0     0      0 LISTEN
      *.25                 *.*                0      0     0      0 LISTEN
      *.111                *.*                0      0     0      0 LISTEN
      *.4045               *.*                0      0     0      0 LISTEN
      *.8111               *.*                0      0     0      0 LISTEN
      *.13722              *.*                0      0     0      0 LISTEN
      *.13782              *.*                0      0     0      0 LISTEN
      *.13783              *.*                0      0     0      0 LISTEN
      *.14001              *.*                0      0     0      0 LISTEN
      *.14002              *.*                0      0     0      0 LISTEN
      *.14003              *.*                0      0     0      0 LISTEN
      *.14004              *.*                0      0     0      0 LISTEN
      *.14005              *.*                0      0     0      0 LISTEN
      *.14006              *.*                0      0     0      0 LISTEN
      *.14007              *.*                0      0     0      0 LISTEN
      *.14484              *.*                0      0     0      0 LISTEN
      *.24481              *.*                0      0     0      0 LISTEN
      *.24483              *.*                0      0     0      0 LISTEN
      *.24487              *.*                0      0     0      0 LISTEN
      *.24488              *.*                0      0     0      0 LISTEN
      *.24513              *.*                0      0     0      0 LISTEN
      *.24518              *.*                0      0     0      0 LISTEN
      *.32771              *.*                0      0     0      0 LISTEN
      *.40001              *.*                0      0     0      0 LISTEN
      *.60003              *.*                0      0     0      0 LISTEN
      *.60005              *.*                0      0     0      0 LISTEN
      *.60006              *.*                0      0     0      0 LISTEN
      *.60057              *.*                0      0     0      0 LISTEN
      *.60222              *.*                0      0     0      0 LISTEN
      *.60680              *.*                0      0     0      0 LISTEN
      *.60797              *.*                0      0     0      0 LISTEN
      *.60798              *.*                0      0     0      0 LISTEN
      *.60799              *.*                0      0     0      0 LISTEN
As with /bin/cut, you will need to adjust the arguments to /bin/sort.

At this point, I had a list of running daemons, a list of which ports had daemons listening on them, and I could regenerate those lists at will. This allowed me to take a snapshot of the running system and compare it to the the baseline snapshot at a later time, much as I had done with FCheck and the filesystem.

A Curious Observation

I removed well-known services (ssh, smtp, sunrpc/portmapper) from the netstat output (see RFC 1340). I initially trusted ps to identify which ports each daemon used, until I hit the following curious entry:

/../bin/rel8/univdbd -port 90054 -connect zzzzzzzzzz/zzzzzzzzzz@prel -numchil
Since port number is defined in RFC 973 "" as a 16-bit unsigned integer, it can't exceed 65535. So, if a process is told to listen to a port greater than 65535, what does it do? Apparently, the process doesn't die or it wouldn't show up under ps. This is when I began using lsof to confirm associations between ports and daemons.

First I retrieved a process ID:

/bin/ps -u excited -o pid,args | /bin/egrep 90054 | /bin/sort -b -n | /bin/head -1
1905 ../../bin/rel8/univdbd -port 90054 -connect zzzzzzzzzz/zzzzzzzzzz@prel -numchil
I fed that into lsof to find out which files and ports the process was using:

# lsof -p 1905
univdbd 1905 excited  cwd   VDIR      32,10       512     71421   \
univdbd 1905 excited  txt   VREG      32,10   2264140     71233   \
univdbd 1905 excited  txt   VREG       32,0   6795316    359900   \
univdbd 1905 excited  txt   VREG        0,1     70168 205389403 /tmp (swap)
univdbd 1905 excited  txt   VREG       32,0     16932    173297   \
univdbd 1905 excited  txt   VREG       32,0   1024888     88720   \
univdbd 1905 excited  txt   VREG       32,0     19304     88302   \
univdbd 1905 excited  txt   VREG       32,0    105788     88347   \
univdbd 1905 excited  txt   VREG       32,0    343432     88541 /   \
univdbd 1905 excited  txt   VREG       32,0    727712     88530   \
univdbd 1905 excited  txt   VREG       32,0     36508     88533   \
univdbd 1905 excited  txt   VREG        0,1         4 205389377 /tmp (swap)
univdbd 1905 excited  txt   VREG       32,0     33588     88305   \
univdbd 1905 excited  txt   VREG       32,0      6708     88320   \
univdbd 1905 excited  txt   VREG       32,0     53656     88314   \
univdbd 1905 excited  txt   VREG       32,0      4304     88632   \
univdbd 1905 excited  txt   VREG       32,0    181840     88304   \
univdbd 1905 excited    0r  VCHR       13,2       0t0    268012   \
univdbd 1905 excited    1w  VCHR        0,0   0t49177    268004   \
univdbd 1905 excited    2u  VCHR        0,0 0t4008043    268004   \
univdbd 1905 excited    3w  VREG       32,3     64114     30534 /logs   \
univdbd 1905 excited    4u  inet 0x6201e410       0t0       TCP *:24518 (LISTEN)
univdbd 1905 excited    5u  FIFO 0x61da6264       0t4      1379 PIPE->0x61da61e0
univdbd 1905 excited    6u  FIFO 0x62121904       0t4      1380 PIPE->0x62121880
univdbd 1905 excited    7u  FIFO 0x62121b84       0t4      1381 PIPE->0x62121b00
univdbd 1905 excited    8u  FIFO 0x621217c4       0t4      1382 PIPE->0x62121740
univdbd 1905 excited    9u  FIFO 0x62121684       0t4      1383 PIPE->0x62121600
univdbd 1905 excited   10u  FIFO 0x62121544       0t4      1384 PIPE->0x621214c0
univdbd 1905 excited   11u  FIFO 0x62121404       0t4      1385 PIPE->0x62121380
univdbd 1905 excited   12u  FIFO 0x62121a44       0t4      1386 PIPE->0x621219c0
univdbd 1905 excited   13u  FIFO 0x61b76244       0t4      1387 PIPE->0x61b761c0
univdbd 1905 excited   14u  FIFO 0x621212c4       0t4      1388 PIPE->0x62121240
univdbd 1905 excited   15u  FIFO 0x62121184       0t4      1389 PIPE->0x62121100
Some important lines are:

univdbd 1905 excited  cwd   VDIR      32,10       512     71421   \
univdbd 1905 excited  txt   VREG      32,10   2264140     71233   \
univdbd 1905 excited    3w  VREG       32,3     64114     30534 /logs   \
univdbd 1905 excited    4u  inet 0x6201e410       0t0       TCP *:24518 (LISTEN)
Line 1 shows that this process was run from /excite/daemon/jp/udbd/sct/travel. Line 2 shows the full path to the executable (/excite/daemon/jp/udbd/bin/rel8/univdbd). Line 3 shows the process is writing to something in /logs/. Line 4 shows the process is listening on TCP port 24518, not port 90054 as the results from ps indicate. Where does 24518 come from? Simple -- truncate 90054 to 16 bits by dividing it by 2**16 (65536) and taking the remainder:

# echo '90054 % 65536' | /bin/bc
This was an unexpected discovery that greatly simplified the process of documenting the daemons and turned the analysis into a fairly tedious exercise. The security implications here should not be underestimated -- can a non-privileged user set code to listen on a privileged port by specifying a port number greater than 65535? This is yet another reason to view setuid scripts with suspicion.

Step 3: Building the Start Scripts

In my case, after I found where executables were stored, I usually found start scripts nearby. It was simple to write a wrapper script and put it in /etc/init.d and symlink it into /etc/rc3.d (Listing 2).

Note the use of /bin/logger in Listing 2 to report daemon activity to syslog and the use of /bin/tty to selectively send output to the console. Also note that unrecognized directives (e.g., anything other than "stop", "start", or "restart") yield process and network status information.

I added chkconfig headers to these startup scripts to simplify deployment on IRIX and Linux systems. They cause no harm under Solaris, and they leave a terse documentation trail for anyone that comes later.

If you are not so fortunate as to find a startup script during your investigations, use the "e" flag to Berkeley ps to show environment information:

# /usr/ucb/ps eauxww | /bin/egrep 1905 | /bin/egrep -v grep
excited   1905  0.0  0.1 8024  488 ?        S   Jul 21  1:44 
../../bin/rel8/univdbd -port 90054 -connect zzzzzzzzzz/zzzzzzzzzz@prel 
-numchildren 10 -logall -loglevel 1 
-logfile /logs/udbd/travel_univdbd.ed.log GROUP=excite HOME=/u/excited 
HOST=caligula HOSTTYPE=sun4 HZ= LD_LIBRARY_PATH=/u/oracle/product/8.0.5/lib: 
LOGNAME=excited MACHTYPE=sparc NLS_LANG=Japanese_japan.JA16SJIS 
ORACLE_HOME=/u/oracle/product/8.0.5 OSTYPE=solaris 
PATH=/usr/sbin:/usr/bin:/usr/local/bin:/usr/ccs/bin:/usr/local/news/bin:/bin:/ \
PWD=/excite/daemon/jp/udbd/sct/travel SHELL=/usr/local/bin/tcsh SHLVL=1 
TNS_ADMIN=/var/opt/oracle TZ=US/Pacific USER=excited VENDOR=sun
This won't create your startup scripts for you, but it does show you an environment that the daemon will run under.

Moving Targets

It took several weeks to identify and catalog all the ports and daemons and to create start scripts. During the project, I sent progress reports and technical data to my group, my department, and the developers in Tokyo. I also explicitly told everyone to inform me before making any changes to these machines, because I was in the process of documenting the existing system. I did not want to spend a lot of time and effort documenting the system only to have it change, invalidating everything I had done. That, of course, is precisely what happened.

However, two things saved me. First, I put FCheck in place to watch for file changes. When seven new daemons appeared on the machines, I knew about it in a few hours -- my developer detection system worked perfectly! Second, I added all the daemons to /etc/services so I could use netstat to detect new daemons. By using netstat without the "-n" option, known daemons would show up with their proper names while new daemons would only appear with a numeric port number:

# /bin/netstat -a | /bin/egrep LISTEN | /bin/sort -b -k 1.3,1.7
      *.14001              *.*                0      0     0      0 LISTEN
      *.14002              *.*                0      0     0      0 LISTEN
      *.14003              *.*                0      0     0      0 LISTEN
      *.14004              *.*                0      0     0      0 LISTEN
      *.14005              *.*                0      0     0      0 LISTEN
      *.14006              *.*                0      0     0      0 LISTEN
      *.14007              *.*                0      0     0      0 LISTEN
      *.acctd              *.*                0      0     0      0 LISTEN
      *.authd              *.*                0      0     0      0 LISTEN
      *.bpcd               *.*                0      0     0      0 LISTEN
      *.bpjava-msvc        *.*                0      0     0      0 LISTEN
      *.chanceit           *.*                0      0     0      0 LISTEN
      *.infod              *.*                0      0     0      0 LISTEN
      *.jpaddrbookd        *.*                0      0     0      0 LISTEN
      *.jpencode           *.*                0      0     0      0 LISTEN
      *.jpnewtransd        *.*                0      0     0      0 LISTEN
      *.jpnewtransd2       *.*                0      0     0      0 LISTEN
      *.jpnewtransd3       *.*                0      0     0      0 LISTEN
      *.jpursplusd         *.*                0      0     0      0 LISTEN
      *.lockd              *.*                0      0     0      0 LISTEN
      *.music_test2        *.*                0      0     0      0 LISTEN
      *.newsedit           *.*                0      0     0      0 LISTEN
      *.nphoto             *.*                0      0     0      0 LISTEN
      *.phd_jp             *.*                0      0     0      0 LISTEN
      *.real               *.*                0      0     0      0 LISTEN
      *.recjnew            *.*                0      0     0      0 LISTEN
      *.simpfeed           *.*                0      0     0      0 LISTEN
      *.smtp               *.*                0      0     0      0 LISTEN
      *.ssh                *.*                0      0     0      0 LISTEN
      *.statd              *.*                0      0     0      0 LISTEN
      *.sunrpc             *.*                0      0     0      0 LISTEN
      *.travel2            *.*                0      0     0      0 LISTEN
      *.vopied             *.*                0      0     0      0 LISTEN
Note the top seven entries. Assuming the developers have not changed ports at random, I only have to investigate and document seven more daemons by the process outlined so far. Also, I haven't lost any of the work I've done to this point. The systems administrators I most respect usually have two notable characteristics -- they hate repetitive work and they hate wasted effort, so I tried to follow their examples and work defensively.

From Damage Control to Efficiency

After the existing daemons were documented and startup scripts were in place, I started monitoring daemon usage. I originally assumed that every daemon running on the systems was actually in use. Since I had solved the problem of logs filling the disks, the major problem was with memory and swap consumption. While I could add memory or swap, it might be easier to turn off any unused daemons. The key was finding which daemons were in use.

Initially, I used netstat and lsof to watch for connections. This method didn't work very well because the connections were transient; I needed a tool to log network activity over time. On a coworker's recommendation, I chose ntop (

/usr/local/bin/ntop -S 1 -t 0 -p /var/adm/z/ntop.protocols.txt \
-P /var/adm/z -d

The ntop GUI normally runs on port 3000. I was primarily concerned with incoming TCP traffic on a machine called caligula, so the results were located at:
Ntop recorded traffic volume for the daemons in which I was interested and clearly indicated all the unused daemons.

Social Engineering

While I was happy that I had solved the immediate technical problems, I knew the underlying software deployment process was still broken. Unless Engineering notified Operations prior to launching new daemons, we would continue to have problems with the jpdaemons service. Using the information I had gathered about system usage and hardware capacity, I explained to Engineering and Operations management that the service was unable to be maintained, and I proposed a number of changes to our launch process. I also explained that I would turn off any new daemons I found unless someone sent me troubleshooting information so I could write start and stop scripts and update /etc/services.


It took about nine months to clean up this system and it's still not finished. I initially reported the problem via established channels and, as one might expect, nothing happened. This is not surprising because I've faced this problem at every organization I've worked for. My goal was to solve the root cause of the problem -- the daemon launch and upgrade process.

I considered the history of this system and our organization: international developers had received less support than their domestic counterparts; production standards and processes were vague at best because of the company's explosive growth and the newness of our technology and management. Also, the implied standards we did have had not been effectively communicated to overseas developers. Added to this were cultural, language, organizational, and temporal differences. As bad as the system might seem at present, it's easy to imagine it devolving into chaos, despite our good intentions.

With that as background, I presented my recommendations:

  • Temporarily disable the "excited" account until the service was rationalized, and the daemon launch process was fixed.
  • Notify Operations before launching a new daemon. Necessary information is, for example:

Service Name -- jpencode

Purpose -- Convert Shift-JIS text to Unicode

Start script -- /excite/daemon/jpencode/

Stop script -- /excite/daemon/jpencode/ Daemon -- /excite/daemon/jpencode/

Port -- 40001

User -- excited

Monitoring -- Check whether daemon answers on port 40001

  • Replace jpdaemons service in site configuration management database with individual services for each of the 30+ daemons found.
  • Rationalize and load balance existing services and make machines as similar as possible.
  • Find an effective liaison between Operations and Tokyo Engineering.
  • Add all daemons to /etc/services.
  • Fix daemons listening on irrational ports (> 65535).

At the end of this project, I felt as though I had won all the battles and lost the war. Still, I don't feel too bad about the project. I made the system usable and maintainable, and I solved every technical problem I faced. I clearly communicated my findings and suggested ways to fix the launch process. Ultimately, I reduced a technical problem to a social problem and left it as an exercise for management. As systems administrators, that's often the best we can do.


chkconfig -- \


FCheck --

lsof --

ntop --

Bob Apthorpe is a Sr. System Administrator with Excite@Home, Inc. Prior to joining Excite, he was a risk and reliability analyst in the nuclear power industry. Off the clock, he is sometimes found performing improvisational comedy in Austin, Texas. He can be contacted at: