Slow: A Tool for Reporting and Diagnosing Performance Problems
Darrell Root and Belinda Liviero
Many performance problems are transitory in nature. Often by the time a user reports a performance problem to the helpdesk, the helpdesk assigns the report to a sys admin, and he or she finds time to investigate the issue, the problem will have subsided. The administrator then has little choice but to close the ticket saying "open another ticket if the problem happens again". This results in frustration for both the administrator and the user: they both want the problem to be solved, so it will not happen again; but the administrator does not have enough data to debug the problem.
The underlying issue is the lack of data. In this article, we introduce a tool for customers to use to report performance problems. While reporting the problem, Slow takes a quick snapshot of data to enable better diagnosis of the problem. Using the Slow data snapshot, administrators have a better chance of identifying the cause of reported performance problems. For example, administrators at two large sites have been able to reach a 60% or higher success rate in root-cause identification of slowness problems using this tool. Trend (pareto) analysis of reported performance issues will also assist administrators in identifying the most important problems to focus on, and will provide historical data that can be used to justify system or network upgrade purchases.
Overview of Slow
The full source code of Slow is available at:
Slow is currently ported to Solaris, SunOS, Irix, AIX, HPUX, and BSDI. It is easily ported to different operating systems, and can also be extended to include site-specific performance debugging tools (e.g., sites using AFS may choose to include commands within Slow to take a snapshot of AFS server performance).
The Slow tool begins by asking the customer to input a one-line description of the problem, and a number (1-10) to rate the problem's severity. Then, based on the operating system on which Slow is run, various diagnostic commands are run to gather each type of data (e.g., on SunOS, Slow uses iostat to get disk IO information; on IRIX, Slow uses sar -d). Diagnostic commands that keep totals since the last reboot, such as netstat, are run twice for a better view of the current situation. The Slow output is emailed to a central email list. This can also be set up to trigger automatic creation of a trouble-ticket. Finally, Slow prints a message informing customers of exactly what level of service they can expect on this trouble report (more on the importance of this later).
Figure 1 shows Slow's selection of diagnostic commands for IRIX. Figure 2 shows some of those commands being executed and the assorted output files being concatenated into one file. Each command run by Slow will place output in its own temporary file; this enables flexibility in the final ordering of the problem report and eases data analysis. For example, Slow executes two sets of netstat -in (once at the beginning of the program, and once at the end). The final report places the output from these two runs together, to allow the sys admin to clearly see any differences in the output of the two commands.
In this article, we will describe the various data-gathering commands used within Slow, and provide suggestions on how the various data can be interpreted.
Using netstat to Identify Network Problems
netstat -in reports the number of network packets, collisions, and input/output packet errors since the last time the machine was rebooted (or the last time the statistics were manually cleared). To determine the current collision and error rate, netstat is run twice (once near the beginning of the Slow run, and once near the end). The results from the first netstat are then subtracted from the second. In Figure 3, the first netstat report shows 2755010 inbound packets on the ec0 interface (we ignore the loopback interface when looking at netstat data). The second netstat report shows 2755195 inbound packets. That means we saw 185 inbound packets during the period when Slow was running (about 1 minute). Similarly, we saw 0 input errors, 102 outbound packets, 0 output errors, and 131 collisions.
Input and output packet errors should be almost zero (beware that some operating systems do not even report packet errors). The rule of thumb in interpreting this data is "the fewer collisions the better". If the ratio between collisions and outbound packets is greater than 20% (obtained by dividing the number of collisions by the number of outbound packets), this indicates a significant problem. Collision percentages greater than 100 are possible if, on average, for each outbound packet the system attempts transmission at least twice due to collisions. In this example, 131 collisions out of 102 outbound packets results in a 128% collision rate.
The real-life example provided is a clear case of a network problem. This example was from a shared 10bt Ethernet network which Multi Router Traffic Grapher (MRTG) data from a router indicated was running at 50% utilization (see http://ee-staff.ethz.ch/~oetiker/webtools/mrtg/mrtg.html). The problem was alleviated once the network hog was identified. Based on repeated performance complaints and MRTG network utilization metrics, the long-term problem was resolved once the network was upgraded to 10/100bt switched Ethernet. The repeated Slow reports formed part of the justification for purchasing the hardware upgrades.
An alternative cause of a high number of reported collisions is a duplex mismatch problem. Ethernet interfaces can be set to full-duplex, half-duplex, or autodetect. Both ends of each network connection (i.e., the host and the corresponding port on the switch) must be set to identical duplex settings to ensure reliable performance. Often, a duplex mismatch will manifest itself with high collisions when transmitting data in one direction, but with zero collisions when transmitting data in the opposite direction.
Using netstat to Detect a Network Hardware Problem
Figure 4 shows netstat output from another Slow report. We have 91 input packets, 6 input errors, 51 output packets, 0 output errors, and 0 collisions.
Input errors should be almost zero; 6 input errors out of 91 input packets is a 6.6% input error percentage, which is not acceptable. These results could be caused by a network hardware problem, such as bad cabling, a loose connection, running 100bt over cat-3 or cat-4 wire, or an Ethernet cable that is longer than recommended in the specification. Input errors could also be caused by another machine on the network with a bad network card or a bad protocol implementation.
Using nfsstat -c to Identify NFS Server and
The NFS protocol uses the remote procedure call (RPC) protocol. nfsstat -c reports RPC statistics on retransmissions, as well as NFS stats on the total number of NFS operations of each type (read, write, getattr, etc.). The number of RPC transmissions is a key piece of data: retransmission rates should be less than 1%.
Note that the numbers are reported separately for UDP and TCP. In Figure 5, we are exclusively using UDP-based NFS (connectionless RPC, as opposed to connection-oriented RPC, which uses TCP). During the 1-minute Slow runtime, the system attempted 107 RPC calls of which 28 timed out. This indicates a timeout percentage of 26%, which is far above the 1% limit.
The following are common causes of NFS retransmissions:
- Excessive server CPU load
- Excessive server disk-IO load
- Network load
- Full/half duplex mismatch problems
- Dropping IP fragments during routing
- Asymmetric routing between the workstation and fileserver
If a fileserver can't respond quickly because of excessive CPU or disk-IO load, NFS requests will time out. CPU load on the fileserver can be examined with top and sar -u; disk-IO load on the fileserver can be checked with iostat and sar -d. Enabling sar data collection on fileservers is a good way to automatically collect this data.
Network load is another possible cause of NFS retransmissions: high collision rates (detected with netstat -in as shown above) indicate high network load on your workstation's subnet. Use of the MRTG tool to monitor network utilization through the routers helps isolate this kind of problem. The cause of the NFS retransmissions in our example was high network load, which was easily confirmed with both the netstat -in data from Slow and the MRTG graphs.
Full/half duplex mismatch problems result in large quantities of NFS retransmissions. Sometimes the NFS retransmissions only affect the traffic in one direction (sequential writes may suffer retransmissions, for example, but sequential reads may be fine).
By default, NFS over UDP uses a 4k or 8k NFS packet size. Ethernet has a packet size (MTU) of 1.5k. To transmit an 8k UDP NFS packet over Ethernet requires fragmenting it into six IP packets. If any one of those six packets is dropped by a router, then the whole 8k NFS packet must be retransmitted. Not all sites suffer from this problem (it depends on the speed of your NFS server, and the speed and buffer memory of your router). One way to fight this problem is to manually set the NFS packet size to 1400. This gives you reduced NFS throughput, but is preferable to the performance degradation caused by the loss of IP fragments. Another solution is to have the network administrator increase the amount of memory available for IP fragmentation on the router.
Asymmetric routing takes place when packets sent to an NFS fileserver take a path different from that of the packets returning from the fileserver. Since most fileservers have multiple network interfaces, this type of misconfiguration is common. It can be detected by running traceroute from the workstation to the server, and from the server to the workstation. Asymmetric routing is almost always suboptimal. In typical cases, one of the asymmetric routes will include unnecessary router hops. In addition, some fileservers are optimized to respond to NFS requests on the same network interface at which they were received.
Using sar -d or iostat to Identify Disk-IO Bottlenecks
A local disk-IO bottleneck is a very common problem. Slow uses sar -d and iostat (depending on OS) to obtain disk utilization information. If both of the following are true for a particular disk, then you probably have a disk-IO problem:
- A disk has a combined wait and service time greater than 100 ms.
- The disk is more than 10% busy.
Sometimes, when you have a very small amount of IO, the "average" service times are above 100 milliseconds simply due to statistical randomness. The 10% rule above is meant to eliminate those cases from consideration.
In Figure 6, the combined wait and service times for the dks42d3 disk is 2476 milliseconds or 2.476 seconds (much greater than 100 ms). Additionally, the disk is 100% busy (more than 10%); thus, this machine has a severe disk-IO problem.
Different operating systems provide different information in their sar/iostat implementations, but the "average service time" and "average wait time" are usually available. On SunOS 4 systems, use the "milliseconds per seek" information in iostat as a substitute.
There are three ways to solve disk-IO problems:
- Reduce the amount of IO that needs to be performed.
- Move some of the data (and IO) to other disks/filesystems.
- Use a faster disk subsystem for this filesystem.
Using top to Identify Performance Problems
top is a program that prints out the CPU usage information for the largest CPU consuming processes. It also prints out other useful information, including memory usage, load averages, process nice settings, and so on. We use top to see which processes are actively using the CPU.
In Figure 7, the rc5des-387-spa process is using 96% of the CPU, and is the cause of the performance problem. The question is "should this process be using up that much CPU?" In this case, the process is supposed to be using all the CPU it can. The customer intended for this program to use all spare CPU cycles, so he set the nice value to 19. Even at this level, however, his performance will appear somewhat degraded. Even if the CPU consumption is insignificant, niced processes still use other system resources such as memory and disk-IO. If this 100% CPU process was not intended to consume vast quantities of CPU time (for example: Netscape), then it is likely to be a runaway process and should be killed.
Using uptime to Identify Performance Problems
Although Slow runs the uptime command, we do not use it to identify any specific performance problems. The amount of time since the last reboot adds perspective to our other diagnostic commands. Additionally, uptime shows us the average number of jobs in the run queue over the past 1, 5, and 15 minutes. This helps us understand if there has been a recent change in system activity.
Using vmstat to Identify Performance Problems
The vmstat command reports virtual memory information. If the system has insufficient RAM to house active processes contending for CPU, it will constantly have to scan memory to free pages and keep up with demand. The scan rate (sr) is listed in the vmstat output. If that rate is continually above 0, then the system is thrashing due to insufficient RAM for your current application mix.
Note that the first line of the vmstat output is an average since the last reboot, which is useless. You can ignore that line.
In Figure 8, the scan rate ranges from 245-895. This is far above 0 and indicates severe thrashing.
If the system is thrashing, look at the memory usage information in the top and ps snapshots to identify the big memory hogs. Three options for solving a memory shortage are:
- Reduce the memory needed by the running applications.
- Install more RAM on the system.
- Use an alternate machine with more RAM to run the memory-intensive application (if the need for additional memory is temporary).
If two large processes running simultaneously on a system cause thrashing, the problem can sometimes be solved by running only one of the large processes at a time. Using a simple batch job queue can be very effective in these cases.
Often, the owner of the process is unaware of its large memory consumption. Simply informing the owner may enable him or her to make simple code and configuration changes to drastically decrease the process' memory footprint.
Using sar -u to Identify Performance Problems
sar -u shows the current CPU utilization broken down into user mode, system mode, waiting-on-IO, and idle. Under some operating systems, the waiting-on-IO time is further broken down into the type of event it is waiting on.
Figure 9 shows sar -u output from a Solaris system. In this case, the machine is running 100% in user mode. This may be good (if the program is doing what the user intended) or may be bad (if the program is a runaway). Use top data to identify the busy process and contact its owner to determine if it is useful or a runaway.
Figure 10 shows more detailed sar -u output from an Irix system. The first five columns (from "%usr" to "%idle") show a breakdown of the CPU state. The last six columns (from "%sbrk" to "%wfif") provide a more detailed breakdown of the waiting-on-IO time only. This data only matters if the "%wio" time is significantly high.
In this example, the "%wio" time ranges from 59% to 89%, which is undesirable. The last six columns show the system waiting exclusively on filesystem IO; sar -d or iostat data will assist in identifying the busy disk. To correct this problem, the options are:
Using ping to Identify Performance Problems
- Attempt to reduce the IO usage.
- Increase the disk subsystem's performance.
The round-trip time for network packet exchanges between workstations and fileservers should as low as possible. The information provided by netstat is useful for detecting network congestion problems on a locally attached net. However, if your system routes to your fileservers, you may suffer performance problems because of congestion on non-local nets. netstat may also fail to provide meaningful data if your underlying network topology is based on switched networks.
Slow includes the ability to ping a "critical server" (via the use of the $PINGHOST variable) and will report data on the round-trip time. This data will help you identify bottlenecks on any network between the workstation and that particular fileserver. Ideally, ping round-trip times should be less than 20 ms (or 2 ms, for best NFS service).
In Figure 11, the ping times range from 24 ms to 388 ms (indicating high variability and very high ping times). High ping times are usually caused by network congestion, although incredibly high CPU loads on the PINGHOST can also contribute to such high values. Check the CPU load on your PINGHOST, and use a network monitoring tool to check network utilization. MRTG data collection is useful in debugging this problem.
Using the Hardware Inventory
Slow uses prtdiag (on Solaris) and lscfg (on AIX) to determine the hardware configuration of the system. This information is useful to determine whether the wrong type of hardware has been allocated for a customer or group's use.
The process table information provided by Slow is useful when one is attempting to find the cause of system thrashing (as reported by vmstat). This detailed data will provide means for identifying the process(es) consuming RAM.
In Figure 12, the alphaport process has an RSS (size resident in memory) of almost 215MB. The process' total size (SZ) is almost 466MB. vmstat data shows that this machine is thrashing. Hardware inventory data shows that this machine has 256MB of RAM. Based on this data, we can infer that this alphaport process is the cause of the thrashing.
Putting the Debugging Together
We now have several tools to identify performance problems. Network problems can be detected with netstat, nfsstat, and ping. Runaway processes can be detected with top, sar -u, and ps. Disk IO problems can be detected with iostat and sar -d. NFS fileserver problems can be detected on the clients with nfsstat. The same (or similar) toolset can be used to troubleshoot performance problem on the servers.
Setting Customer Expectation Levels
Figure 13 shows a message that is printed by Slow whenever a customer uses this tool to report a performance problem. It is very important to decide on and communicate the level of service you will provide to your customers on reported performance problems. By doing this, you won't overcommit your organization or disappoint your customers.
At our site, we have chosen to commit to a low level of service (we don't even promise to respond to the customer at all). In reality, however, we respond individually to over 90% of the Slow reports.
Pareto Analyses, Metrics, and Engineering
Troubleshooting performance problems using Slow is an example of firefighting. Simply identifying the immediate problem and applying a quick fix is not satisfactory. Your goal should be to identify recurring performance problems and solve them permanently. Priority should be given to the most important (or most pervasive) performance problems.
To identify the most important performance problems, we categorize them by cause. We then total up the number of performance problems in each category. Figure 14 shows example totals. Identifying the most important problems in this manner is called a pareto analysis.
In this example, the most important performance problem we needed to solve was network related. At the time, we were using shared 10bt Ethernet networks. This problem was both critical and pervasive (as it affected multiple customers). This performance problem pareto analysis was a key part of the budget justification for network upgrades.
There may be a large percentage of performance problems for which a cause cannot be identified. In such cases, time should be spent trying to improve the type of data collected within the Slow tool so that number of "unknowns" can be decreased. Even so, successfully debugging more than 70% of reported performance problems is very difficult.
Weaknesses of Slow
Slow is good at finding performance problems caused by a lack of resources. Examples include a lack of RAM, lack of disk-IO capacity, lack of network bandwidth, and lack of CPU. Slow is less effective at identifying performance problems caused by bugs. For example, the IRIX window manager gradually gets slower over several weeks if the user never logs out, but Slow is unable to detect that sort of problem. To debug problems caused by bugs, more application-specific approaches are required (e.g., system call tracing, attaching with a debugger, enabling debug output).
Another weakness of Slow is that the machine must be usable to the point that Slow can collect data. If a workstation with a NFS-mounted /usr/local filesystem were to completely lose its network connection for 60 seconds, Slow would never detect the problem. This is because it wouldn't be able to execute at all until after the network connection is restored (assuming that /usr/local is in the user's PATH).
Slow is an effective tool for obtaining a shapshot of a workstation at the time a performance problem occurs. This provides both a timestamp and (in many cases) relevant data to enable analysis of the problem even after it has occurred. Slow can be easily modified to support additional operating systems and run additional site-specific debugging commands. By performing pareto analyses on Slow reports, the most important performance bottlenecks in the environment can be identified.
About the Author
Darrell Root has B.S. degrees in Mathematics and Computer Science. He works for Sterling Software at the NASA Ames Research Center. Darrell's specialties are performance analysis and large-scale fileserver engineering. He can be reached at email@example.com.
Belinda Liviero has an M.S. degree in Computer Science from Clemson University. She has worked at Intel Corporation for five years, where she currently manages the UNIX and NT client engineering teams for a large microprocessor design site. She can be reached at firstname.lastname@example.org.