jun2002.tar

System Calls and the I/O Path

Henry Newman

If you read my last column, you might ask why would a systems administrator need to know about the I/O path for the C library, much less the path for UNIX system calls? During my two+ decades in the business, some of the biggest mistakes I've seen sys admins and vendors make is to optimize I/O without fully understanding how it all works together. You cannot configure and optimize storage systems (hardware and software) without fully understanding the file sizes, number of files, and most importantly, the I/O sizes that the file system will see from the applications. This month, I will be looking at system calls and how they send data to the file system. Future columns will cover:

File system structure and the history
File system performance and configuration
RAID systems configuration and performance
Secondary storage and library performance
Backup and HSM performance issues
Development of a storage benchmark based on your system

Each of these areas might take a few months to cover, so this outline will last for a while. I am interested in your feedback, so please let me know if you have any suggestions.

I/O Path (Using UNIX System Calls)

My last column covered the I/O path using the C library package. I'll now discuss the details of using direct UNIX system calls for doing I/O, and I'll also discuss the path the data follows if requests do not begin and end on 512-byte boundaries. The system path for both system call and C library I/O is the same in this case.

When a program uses system calls for I/O, a large number of variables affect both the path and the performance. I will begin with a description of some important system calls and how they are used. There are two types of I/O supported on most UNIX systems:

Synchronous I/O -- Each I/O request waits for the completion of the last request to that file descriptor. With synchronous I/O, you will wait for the I/O request before execution of the next instruction in the application.
Asynchronous I/O (AIO) -- I/O requests are sent to the system and then the synchronization is requested by the application. With asynchronous I/O the requests are issued and the next instruction is immediately executed.

System Call for Synchronous I/O

The following is a list of common system calls and their meaning:

open -- Opens a file descriptor. Important options for some systems including systems with:

a. Large files over 2 GB

b. Synchronous I/O for data integrity

c. Reading/writing

d. Direct I/O -- I/O that moves directly from user space to the device without using any system caching.

lseek -- Set the file descriptor to the byte position specified. Some systems require the use of lseek64 for files larger than 2 GB.

read -- Reads data from a file descriptor into the user data area (buffer); allows the application to check for errors.

pread -- Reads data from a file descriptor into the user data area (buffer) from a specific location in the file. This is the equivalent of a read and lseek, but does not set the file pointer to the position as the lseek system call does; allows the application to check for errors.

System Call for Asynchronous I/O

POSIX standards for real-time systems require support for asynchronous I/O (POSIX.4). The concept of asynchronous I/O (at least from what I can determine) came from the benchmarking group at Control Data Corporation back in the late 1960s. They needed a way to read data from a FORTRAN program while still being able to execute the program because the disks were far slower than the CPU. (Not much has changed since then, has it?). Today, most operating systems support asynchronous I/O (AIO) system calls in the library libaio and in the operating system.

aio_read -- Asynchronous read request; allows the application to check for errors.

aio_write -- Asynchronous write request; allows the application to check for errors.

lio_listio -- A special call that allows you to issue a list of read or writes with a single system call. This is very useful when reading or writing a number of records in a file at the same time; it allows the application to check for errors. (I think Larry Schermer, from Cray Research, originally created the list I/O call in the 1980s during Cray's transition to UNIX.)

How Does It All Work?

The most important thing to remember about the basic unit for the storage hardware is that all physical requests must start on 512-block boundaries and end on 512-block boundaries. Figure 1 shows an example.

The first request begins on a 512-byte boundary and ends on a 512-byte boundary, while in the second example (in Figure 1) the I/O request does not begin and end on 512-byte boundaries. So what happens in the system when you do not make I/O requests on 512 boundaries? What actually happens is that the system must convert the requests to 512-byte boundaries, because that is a physical hardware limitation and the overhead is extremely high.

What the System Does

There are a large number of N-cases that I will try to cover over the next few months, so I'll start with the simplest case and work to the most complex. Much of this information will be covered in more detail in future columns because it relates to file system implementations and issues with direct I/O. Table 1 shows what happens in the most simplistic cases.

Applications Programming Issues

As you can clearly see, making requests that are not on 512-byte boundaries can cause serious performance problems for the operating system. There are a few important dos and don'ts to ensure system performance for programs that perform I/O through either the system calls discussed previously, or the C Library package discussed in my column last month. Table 2 summarizes I/O types and the tradeoff between the two I/O methods.

Sequential I/O

In general, changing the request size for a program that does sequential I/O using the C library package is easier than changing programs that make direct system calls. All the user must do is add a single line after the open to call the setvbuf(3) function. For programs that often make system calls, this requires major rewrites to the program and data restructuring in order to make larger requests. This often has implications on the program's computational structure.

Random I/O

Random I/O is when a program does not access a file reading or writing from the start of the file to the end. Sometime programs that do random I/O really skip randomly within the file and other times they do read a file backwards. Programs that do random I/O are often a rather difficult way to improve the I/O performance, and the solution is rarely very simple. I have occasionally seen that the file being opened is small by today's standards and can actually fit in memory. I had one program that I worked on a few years ago that used a 40-MB random I/O file using the C library package opening and closing it thousands of times so it was never kept in memory. A simple change I made was to set the buffer size to the size of the file and remove the opens and closes. Thus, I dramatically improved wall-time performance and CPU performance.

The issues with random I/O are a bit more complex. You never want to make requests bigger than the physical request from the application. If the requests begin and end on 512-byte boundaries, then system calls are your best choice. If they do not begin and end on 512-byte boundaries, then using the C library and setting the buffer to the request size rounded to the next 512-byte boundaries is a better choice, given the read-modify-write that will be required for the system calls.

It is important to note that random I/O is not always as random as you think. Your applications often do what I call "randomly sequential I/O". A number of requests are made sequentially, a seek request is made, and I/O are then done sequentially again. From what I have seen from traces, this is very common in databases, search engines, and a number of scientific applications. It is more difficult to improve the performance of programs that do random I/O than programs that do sequential I/O.

Conclusions

It should be noted that for RAID devices the basic block size is not 512 bytes, so you can have another level of inefficiency. This will be covered in a later column. My next two columns will cover file system tuning, configuration, and performance as we move down the layers of the hardware and software to the disk.

Henry Newman has worked in the IT industry for more than 20 years. Originally at Cray Research and now with a consulting organization, he has provided expertise in systems architecture and performance analysis to customers in government, scientific research, and industry around the world. His focus is high-performance computing, storage and networking for UNIX systems, and he previously authored a monthly column about storage for Server/Workstation Expert magazine. He may be reached at: hsn@hsnewman.com.