A Methodology for Analyzing Network File System Performance Problems
In the multi-vendored, distributed environments that we've all become accustomed to, using NFS to distribute files and directories is as normal as knowing the root password when you are the system administrator. How else can you get any work done? Unfortunately, however, NFS has the appearance of being difficult to understand and tune. Do you look at the client or the server? How can nfsstat output be used to understand what NFS is doing? What other tools are available? In this article, I hope to provide a better understanding of how to tackle NFS-related performance problems.
Overview of NFS
The Network File System (NFS) was developed in the mid-1980s by Sun MicroSystems. With quick acceptance in the UNIX community, NFS became the standard for how a distributed file system should look and work. NFS, version 2.0, is available on virtually all operating systems, including all UNIX flavors, PC DOS and Windows, and even MVS.
My discussion of NFS performance will be generic where possible. Nevertheless, each implementation of NFS has its own peculiarities, which I will try to address when possible. My current NFS experience is limited to Solaris 2, AIX 4, and HP-UX 9 and 10. I will note any NFS performance or configuration differences to be aware of between these systems.
How NFS Works
It is important to understand the flow of the I/O from the application both to and from the NFS-based file being accessed. Let's start from the application (see Figure 1). An application might be a user-written program or a UNIX utility such as ls or cp.
Somewhere in the application we access the file in question via a call to the system procedure read() or write(). In this system procedure, the vnode of the file points to appropriate execution procedures that can deal with the kind of filesystem that this file resides on. This might be a UFS file system or, in our case, an NFS file system.
If we are reading from or writing to an NFS file system, then the data request is sent to the client NFS daemon, called biod. The request is then transferred to the server NFS daemon, called nfsd, which resides on the NFS server. This daemon performs reads and writes to physical files based on the requests that biod has sent it.
The data or status is returned to the biod, which in turn returns the information to the system call procedures, which return the results to the actual application. The specific number of biod and nfsd daemons that run on your client and server systems is an important performance consideration to be discussed later.
NFS v2.0 implementations use UDP protocol for communications. As a connectionless protocol, no state information is maintained by the server on behalf of a specific client. In addition, the buffer sizes used by NFS v2.0 are limited to what UDP supports.
At the lowest level, the programming API for network communications is "socket programming" as sockets represent the addresses to which you want to communicate. At this level of programming, you access the raw data that was transmitted (in one or more packets) and massage it so that it is in a format you want. In addition, the application must deal with error conditions.
Remote Procedure Calls (RPCs) are an API layer above sockets. RPCs follow the procedure paradigm; you call a procedure passing it parameters that get loaded with data. In reality, this procedure is executed on the remote computer, and the data is returned and loaded into variables on your local computer. There is a different RPC procedure for each of the various NFS functions (e.g., getattr, lookup, read, etc.). The nfsstat command presents statistics of each of the NFS RPC calls.
Finding NFS Problems
Overall, there are three main blocks of functionality in the flow of the data that can cause performance problems. (First, there is the client computer, which includes the application program and the biod daemons. Second, there is the network that supports the NFS requests and data that must move between the client and server. Finally, there is the server that processes the NFS requests. To locate the source of problems, it is often useful to start with the network. What appears to be an NFS problem often is really a network problem, caused either by some hardware malfunction or data overloads.
There are two basic sources of information about network problems. The first is hardware-dependent: the router. With an intelligent router and a program that can talk to an SNMP device, much information can be retrieved. The second source comes from the UNIX system via the netstat command (see Listing 1). Although nfsstat has many options, we are interested in options -in. The "i" flag says we want interface data as opposed to socket connection data. The "n" tells netstat not to convert IP addresses to their human-readable names. This is important if indeed we are having some kind of network problem. If netstat cannot contact the name resolver computer (which translates the IP address to a name) due to network problems, netstat will just hang. Using the "n" option avoids that problem from the outset.
There are five statistics for each defined interface that will be displayed (on Solaris and HP-UX systems, there are 6). The numbers displayed are formatted versions of counters kept in the kernel. These counters constantly increment. To get meaningful information, you should enter a command such as netstat -in 10, where the "10" tells netstat to collect the interface statistics every 10 seconds and to "diff" the current data with the previous values. There should be no input or output errors. Any value greater than zero is indicative of a possible hardware problem and should be examined.
Deciding if you have a problem with packet collisions is more difficult. The more collisions you have, the more retransmissions will be sent. You should be aware of the type of network connectivity you have (10-Base-T, FDDI, etc.) and its capacity. Are you near that capacity? Why are there so many packet transfers taking place? Is this normal? You may need to investigate to determine which machine is the originator of the network traffic; it may or may not be NFS activity, but network traffic will certainly affect the NFS throughput.
There are a number of components to the client system that may have an effect on NFS performance. The only real measure for determining whether there is a problem is if the application has a poor perceived response time. The problem could be the NFS server; however, it could also be the client. It might simply be a CPU or memory problem on the client system that isn't allowing the application to get enough CPU cycles.
I'll start by looking at system CPU activity. If the system is running at near 100% capacity, the biod daemons may not be getting enough priority to execute. Perhaps the application itself cannot get enough CPU cycles. You can get CPU stats via the command vmstat 10, where the 10 means to continue running, displaying the CPU statistics every 10 seconds. Examine the three right-most columns: "us", "sy", and "id". These numbers are percentages that the system is running in user mode, kernel (or system) mode, and idle. On AIX systems, there is also a fourth field, "wa", or "wait time". Another overall indicator of CPU starvation is the run queue length. This number represents the number of processes ready to run, only waiting on the CPU. You can get this metric via the w -u command. The three numbers seen to the right of "load average" represent the average run queue length for the last 1, 5, and 15 minutes. (Note: If your system is not running AIX, Solaris, or HP-UX, check your man pages for "w," because its interpretation may differ. Digital UNIX displays the run queue for the last 5, 30, and 60 seconds.)
Memory consumption can also cause NFS performance degradation. The quickest check on memory problems can be seen from the scan rate variable. This is the number of pages scanned by the clock algorithm. Ideally, it should be zero or very close to it. Once generic system resources have been eliminated, the biod processes should be examined. How many biod daemons should be running on your system?
On HP-UX systems, the default value is 4. In their white paper entitled "NFS Client/Server Configuration; Topology and Performance Tuning Guide", HP states that the recommended number of biod daemons should be between 12 and 16. Look in file /etc/rc.config.d/nfsconf for reconfiguring NFS for this value.
On AIX systems, the default number is 6. IBM states that you should have at least two times the number of files you are writing to simultaneously. So, if you have a total of five processes each writing to two NFS files, then you should have ten biod daemons. To change the number of biod daemons on an AIX system, use smit (Communications Applications and Services -> NFS -> Network File System -> Configure NFS on This System -> Change Number of nfsd & biod Daemons).
Solaris systems are easy to configure; they can't be configured! The biod functionality is implemented as a kernel thread on Solaris. As more requests are initiated, more threads are automatically created.
The question of what NFS is actually doing can be answered by the nfsstat command. Specifically, execute the command, nfsstat -rc, which gives RPC data for the client (see Listing 2).
The vast majority of the fields displayed are identical across all platforms (Sun's Solaris operating system adds a few goodies of its own). The "calls" field shows the total number of RPC calls made by this client. "badcalls" is the total number of calls that were rejected by the RPC layer. Badcalls frequently implies that RPC calls on soft-mounted file systems are timing out. One reason, of course, is that the server has crashed. Alternatively, there may be some slowdown on the network or the server. In that case, you will want to increase the timeo or retrans parameter values via the mount command (see below), but there are other indicators we should look at first.
The field "badxids" is the number of times a reply was received from a server that did not correspond to any outstanding call by the client. How can this happen? Imagine the following scenario: The client sends an NFS request to the server. It receives no response, so it resends the request. Subsequently, a response comes back that satisfies the outstanding request, and the client is happy. Then the second response arrives (from the second request), but now, there is no outstanding request. This increments the "badxids" field. To fully determine what has happened, we must also look at the "timeout" field. This field displays the number of times a call timed out while waiting for a reply from the server. If the timeout value is significant (as a starting point, consider 5% of all calls), and "badxid" has a value that is approximately the same as timeout, then the server is getting the requests but returning them too slowly. However, if the timeout value is significant, but "badxid" is close to zero, then the server isn't getting the requests at all. This should give you enough information to proceed.
Next, we'll look at the type of NFS transactions taking place. To do this, execute the command nfsstat -cn (Listing 3). The majority of data presented with this command are the counters and percentages of the different types of NFS calls executed (i.e., the distribution of NFS calls). To properly interpret the data, you must know what your client system is supposed to be doing. If most of the NFS activities are file reads and writes, you would expect the "read" and "write" fields to have the largest percentages. If, however, you see that most NFS activities are "lookup" or "create" or "getattr," then you may have either an application problem or a client caching problem.
Most of the output fields from the nfsstat command are obvious, such as "mkdir", "readdir", or "statfs", but a few aren't. "lookup" returns a filehandle and file attributes for a filename in a directory. "getattr" returns file attribute information (such as creation timestamp, file size, etc.) for a specific filehandle. "readlink" returns the string in the symbolic link at the given filehandle. "symlink" creates a file with the given attributes, which is a symbolic link to the given pathname. For a better understanding of each field type, read RFC-1094, the NFS v2.0 specification. RFC-1813, the NFS v3.0 specification, actually provides better definitions of the fields, however.
Let's turn the discussion to the mount command, which is executed on the client computer, and its options. Two very important options are timeo and retrans. The timeo parameter specifies the NFS timeout time in tenths of a second. Defaults for this parameter are 20 for HP-UX, 11 for Solaris, and 7 for AIX. Based on the nfsstat -cr output above, this parameter may need to be increased.
retrans is the number of NFS retransmissions that should be attempted. Defaults for this parameter are 4 for HP-UX, 5 for Solaris, and 3 for AIX. The following example sets the timeout value to 5 seconds:
mount -r -o rsize=4096,wsize+4096,timeo=50 serv:/export/src \
Another important parameter is the rsize and wsize pair. These parameters define the read and write buffer sizes. By default, the values are 8192. Frequently, I have found problems using an NFS-mounted file system when my client computer is an Intel-based UNIX box. My system would hang or the data access/transfer would stop functioning. By mounting these file systems with a rsize and wsize of 4096, as above, all such problems completely disappeared. I have seen this behavior on both older Interactive Systems' UNIX and Solaris X86. This behavior may be due to underpowered network interface cards.
The last set of mount options I'll discuss are the attribute cache definition options. The attribute cache is a cache on each client system of the attributes of recently accessed directories and files. Attributes include such things as the create, modify and access times, file size, block size, etc. If the attribute cache is too small, many "getattrs" will be seen. The size of the attribute cache is configured with the following options:
actimeo - Absolute time a file entry is kept in cache after an update; if specified, this overrides the next two options.
acregmin - Minimum time a file entry is kept in cache after an update (default = 3).
acregmax - Maximum time a file entry is kept in cache after an update (default = 60).
acdirmin - Minimum time a directory entry is kept in cache after an update (default = 30).
acdirmax - Maximum time a directory entry is kept in cache after an update (default = 60).
Excessive "lookups" or "getattr" values in your nfsstat output may indicate that you need to adjust one or more of the above options when NFS-mounting the remote file systems. The idea is to minimize re-requesting attribute information from the NFS server as these requests take time to process. Additionally, pay attention to the applications that are running on the client. Are they supposed to be requesting file attribute information with such a high frequency?
Next, I will examine the components on the server system. The NFS software that runs on the server are the nfsd daemons, which also consume CPU, memory, and perform I/O (responding to the client requests).
I'll start by discussing the global CPU consumption. If the CPU is less than 20% idle, see about reducing the CPU load on the system. Identify process CPU consumption with the ps command. Alternatively, you may want to look for a process-oriented performance monitor such as top, monitor, or a commercial product. Identifying the source of disk I/O is more difficult. If your system's primary task is servicing NFS requests, make certain no other I/O-intensive tasks are being performed on this system. And if running other programs cannot be helped, then minimally try to move files and directories that are not NFS-related to disk drives other than those being used by the client systems. Identify disk I/Os by use of the iostat command. A typical command might be iostat -d 5 100 where the 5 is the repeat rate of 5 seconds, and the 100 means that you want one hundred iterations of these statistics. Refer to your man pages to interpret the output, as each vendor provides a different mix of statistics from this program.
Similar to determining the number of biod daemons, we also need to determine a good starting point for the number of nfsd daemons on the server computer. On HP-UX systems, the default number of daemons is 4. However, HP recommends that the starting value should be 32. AIX systems start with 8 nfsd daemons. They recommend, however, that the starting value should be set to the sum of all biod daemons on all clients accessing this server, plus a 20% slack. You can see how many processes are running by performing a ps -ef | grep nfsd (on SYSV-type systems).
Sun, too, can't start the nfsd daemons with a decent value on their Solaris operating system. Note, however, that you will always see only one nfsd process when examining the system with the ps command. Most of the work done by nfsd is in the kernel as a set of kernel threads. There are 16 threads by default. Sun suggests starting with a minimum of 32 threads (threads are inexpensive). A ps -ef | grep nfsd on Solaris shows one command line like this: /usr/lib/nfs/nfsd -a 32.
To understand what NFS is doing on our server system, we turn again to the nfsstat command. We execute the command, nfsstat -sr, which presents server RPC data (see Listing 4). Each operating system (except AIX) displays a few additional RPC attributes. The common fields are "calls", "badcalls", "nullrecv", "badlen", and "xdrcall". Calls are the total number of RPC calls that have been received from all clients. Badcalls are the total number of calls that have been rejected by the RPC layer. This is the sum of "xdrcall" and "badlen". If this number is greater than zero, then requests are being rejected by the server. This could be caused by authentication problems resulting from having a user in too many groups (an NFS gotcha), or by attempts to access exported file systems as root when root has been excluded. "Badlen" is the count of RPC calls with a length shorter than the minimum-sized RPC call, and "xdrcall" is the count of RPC calls whose header could not be decoded by external data representation (XDR) translation.
The next level of analysis includes looking at the types of NFS transactions that are taking place. To do this, we execute the command nfsstat -sn (see Listing 5). The first field of interest is "null". This field is incremented when a client automounter pings the server. A large value indicates that automounter mounts of file systems are being repeatedly performed, which consumes network bandwidth. Increase the client timeout value on the automounter command line.
Examine the "getattr" field. As a starting point, if this number represents more than 30% of all the RPC calls, you should check the size of the attribute cache values on the NFS client machines. The attribute caches may not be properly configured.
The "lookup" field represents the number of directory lookups that were performed (either for a file or a directory entry). Assume, for example, that you want to open file /export/A/B/Filename where /export is the starting point of the mounted file system. There will be lookups for A, for B, and for Filename before the open is executed. Each lookup is an exchange of information between the NFS server and its client. If the percentage of the lookups represents the majority of the RPC activity, then the NFS server is spending most of its time doing directory reads - not necessarily a productive use of the server. In a similar manner, the field "readlink" represents the number of symbolic links that were returned to the clients. If this number is unusually large, then an analysis of the directory structure and symbolic links is in order.
Since the main activity of the server system is to provide I/O services for client machines, it is important that the system be appropriately tuned for large numbers of I/Os. Here things become somewhat vendor-specific. For example, HP (HP-UX 9.x and 10.x) uses a dynamic buffer cache (DBC, a subset of real memory), which is dedicated to buffers for metadata and file contents. Therefore, the size of the DBC should be maxed out. Two parameters define the use of the DBC: dbc_min_pct and dbc_max_pct represent the minimum and maximum amount of memory to dedicate to the DBC. By default, db_min_pct is 5%. For a file server, this parameter should be set to about 80% or 85%, keeping in mind that you want to reserve a minimum of 32 Mb for the kernel. On HP-UX 10.10 or earlier, the maximum size of the DBC is 800 Mb. With HP-UX 10.20, the maximum size was extended to 3.75 Gb.
On AIX systems, you should set the maxperm parameter, which controls the maximum percentage of memory occupied by file pages to 100%. This is done via the following command: vmtune -P 100.
On Solaris systems, the memory manager pages everything utilizing all available memory. Therefore, as long as you are not running non-NFS related programs that consume much memory, the operating system will ensure that memory is filled with the contents of files that the nfsd daemons are accessing.
Another kernel tunable is the table size of the inode table. This table maintains a cache of inodes (active and/or inactive, meaning currently open files or recently closed files). If the inode cache is full, then new files cannot be opened. An HP-UX rule of thumb for ninode (the kernel variable name on HP-UX) is to allocate 1-2 inodes per 8 Kb of memory in the DBC. On Solaris 2.4 and later, the kernel variable ufs_ninode only represents the number of inactive inodes (recently closed files). There is no active inode limitation in Solaris. With Solaris 2.3 or earlier, ufs_ninode represents the total number of inodes, active and inactive. Set ufs_ninode to one half of the size of the directory name lookup cache (DNLC) for NFS servers (this is kernel variable, ncsize). On AIX, the kernel variable is nhino.
One point should be made about the actual file systems that the server has exported. In many shops, a dedicated file system is used to store utilities such as the files that might be stored in the directory /usr/local. Since this directory is only accessed for reading or executing, you should mount it on the server, read-only. Then, the client system similarly mounts the NFS file system using the -r option (read-only). The advantage is that the inodes for the files that are accessed do not need to be flushed (to update the access time).
The design of the client environment and the programs that are running on that system can have a significant effect not only on the client but also on the server system. An example of a poorly designed client environment might be seen in a development shop. Imagine that your home directory is on an NFS server running AIX 4.1. In a subdirectory, you create a build environment; your sources are in one of the subdirectories (on the server), and the object files are also on the server. You log into a Solaris box as needed to build a Solaris binary of the application. During your build, the compiler reads the ".c" files from the server and writes the ".o" files back to the server. The ".c" files can certainly be on the server. But do you need to globally save the ".o" files, too? Store them locally on another file system. I have seen a 30-minute compile in this environment be reduced to a 10-minute compile when the ".o" files were stored on the local client system.
A second problem is related to the PATH command. Suppose your PATH command looks like this:
where /usr/local/bin and /export/local/bin are NFS file systems. If you execute the command "A", which is a program in your local home directory, the loader starts by looking for A in /usr/bin, then /usr/local/bin, then /export/local/bin, and finally your home directory, ".". When looking in /usr/local/bin, note that NFS has first to do a lookup for local, then for bin and finally for A before moving on to the next directory in the PATH list. Reduce resource consumption by carefully examining PATH. Should the "." be placed first? Perhaps all NFS file systems that might be referenced should be placed last.
To understand the load that the application is creating on the NFS server, run nfsstat -c (all client output) before and after the application runs. To get valid statistics, be sure no other client programs are accessing NFS. The numbers will tell you the NFS load per program or per transaction. Another option is to take advantage of the system tracing mechanism. On AIX systems, a process-level or system-level trace mechanism can be selectively turned on (refer to trace, trcon, trcrpt). Each NFS RPC call can be individually observed and logged. The Solaris truss facility does not go into the kernel, where the biod functionality is located.
The latest version of NFS developed by Sun MicroSystems is version 3.0, which has been available on Sun Solaris systems for about one year. Both IBM and HP have released their own implementations of version 3 of NFS. It is not clear, however, how many sites are actually using the new version. (see sidebar "Futures and Resources").
One of the major enhancements associated with NFS 3.0 is increased performance. One source for this performance boost is that you can now choose TCP as the NFS protocol instead of UDP. There are no limitations on the maximum buffer size for the NFS transfers with the TCP protocol, allowing the software to take advantage of the high bandwidth network technologies such as FDDI.
NFS v3.0 also offers something called "safe asynchronous writes." With safe asynchronous writes, client applications no longer have to wait for individual write requests to complete as they did with synchronous writes under today's NFS v2.0. Instead, multiple write requests are collected and then written to the server at one time.
Additional NFS v3.0 features include such things as high availability (by replicating the NFS server), additional security by use of Kerberos for authentication, access control lists (ACLs) for authorization, and DES-based encryption. Converting from NFS version 2 to version 3 can be painless as the version 3 NFS server software supports clients using both the new and older versions of NFS. This switching back and forth is accomplished through the RPC protocol, used by NFS, which supports versioning of a service (the version number is part of the RPC call).
When tuning a network using NFS, remember the data flow from the application on the client computer through the network to the server and back to the application. Impediments to the flow of data can occur anywhere along that path. By looking at each of these subsystems along the way with the tools described in this article, you should be able to determine which subsystem is causing the slow response time back to the application.n
About the Author
Joe Berry is a senior software developer at Landmark Systems Corporation in Vienna, Virginia. He is one of the authors of Landmark's PerformanceWorks products, PerformanceWorks/Smart Agents for UNIX. A former systems specialist for Hewlett-Packard, he has been in the computer industry for almost 30 years.