System
Calls and the I/O Path
Henry Newman
If you read my last column, you might ask why would a systems
administrator need to know about the I/O path for the C library,
much less the path for UNIX system calls? During my two+ decades
in the business, some of the biggest mistakes I've seen sys
admins and vendors make is to optimize I/O without fully understanding
how it all works together. You cannot configure and optimize storage
systems (hardware and software) without fully understanding the
file sizes, number of files, and most importantly, the I/O sizes
that the file system will see from the applications. This month,
I will be looking at system calls and how they send data to the
file system. Future columns will cover:
- File system structure and the history
- File system performance and configuration
- RAID systems configuration and performance
- Secondary storage and library performance
- Backup and HSM performance issues
- Development of a storage benchmark based on your system
Each of these areas might take a few months to cover, so this
outline will last for a while. I am interested in your feedback,
so please let me know if you have any suggestions.
I/O Path (Using UNIX System Calls)
My last column covered the I/O path using the C library package.
I'll now discuss the details of using direct UNIX system calls
for doing I/O, and I'll also discuss the path the data follows
if requests do not begin and end on 512-byte boundaries. The system
path for both system call and C library I/O is the same in this
case.
When a program uses system calls for I/O, a large number of variables
affect both the path and the performance. I will begin with a description
of some important system calls and how they are used. There are
two types of I/O supported on most UNIX systems:
- Synchronous I/O -- Each I/O request waits for the completion
of the last request to that file descriptor. With synchronous
I/O, you will wait for the I/O request before execution of the
next instruction in the application.
- Asynchronous I/O (AIO) -- I/O requests are sent to the
system and then the synchronization is requested by the application.
With asynchronous I/O the requests are issued and the next instruction
is immediately executed.
System Call for Synchronous I/O
The following is a list of common system calls and their meaning:
open -- Opens a file descriptor. Important options
for some systems including systems with:
a. Large files over 2 GB
b. Synchronous I/O for data integrity
c. Reading/writing
d. Direct I/O -- I/O that moves directly from user space to
the device without using any system caching.
lseek -- Set the file descriptor to the byte position
specified. Some systems require the use of lseek64 for files
larger than 2 GB.
read -- Reads data from a file descriptor into the
user data area (buffer); allows the application to check for errors.
pread -- Reads data from a file descriptor into the
user data area (buffer) from a specific location in the file. This
is the equivalent of a read and lseek, but does not
set the file pointer to the position as the lseek system
call does; allows the application to check for errors.
System Call for Asynchronous I/O
POSIX standards for real-time systems require support for asynchronous
I/O (POSIX.4). The concept of asynchronous I/O (at least from what
I can determine) came from the benchmarking group at Control Data
Corporation back in the late 1960s. They needed a way to read data
from a FORTRAN program while still being able to execute the program
because the disks were far slower than the CPU. (Not much has changed
since then, has it?). Today, most operating systems support asynchronous
I/O (AIO) system calls in the library libaio and in the operating
system.
aio_read -- Asynchronous read request; allows the application
to check for errors.
aio_write -- Asynchronous write request; allows the
application to check for errors.
lio_listio -- A special call that allows you to issue
a list of read or writes with a single system call. This is very
useful when reading or writing a number of records in a file at
the same time; it allows the application to check for errors. (I
think Larry Schermer, from Cray Research, originally created the
list I/O call in the 1980s during Cray's transition to UNIX.)
How Does It All Work?
The most important thing to remember about the basic unit for
the storage hardware is that all physical requests must start on
512-block boundaries and end on 512-block boundaries. Figure 1 shows
an example.
The first request begins on a 512-byte boundary and ends on a
512-byte boundary, while in the second example (in Figure 1) the
I/O request does not begin and end on 512-byte boundaries. So what
happens in the system when you do not make I/O requests on 512 boundaries?
What actually happens is that the system must convert the requests
to 512-byte boundaries, because that is a physical hardware limitation
and the overhead is extremely high.
What the System Does
There are a large number of N-cases that I will try to cover over
the next few months, so I'll start with the simplest case and
work to the most complex. Much of this information will be covered
in more detail in future columns because it relates to file system
implementations and issues with direct I/O. Table 1 shows what happens
in the most simplistic cases.
Applications Programming Issues
As you can clearly see, making requests that are not on 512-byte
boundaries can cause serious performance problems for the operating
system. There are a few important dos and don'ts to ensure
system performance for programs that perform I/O through either
the system calls discussed previously, or the C Library package
discussed in my column last month. Table 2 summarizes I/O types
and the tradeoff between the two I/O methods.
Sequential I/O
In general, changing the request size for a program that does
sequential I/O using the C library package is easier than changing
programs that make direct system calls. All the user must do is
add a single line after the open to call the setvbuf(3) function.
For programs that often make system calls, this requires major rewrites
to the program and data restructuring in order to make larger requests.
This often has implications on the program's computational
structure.
Random I/O
Random I/O is when a program does not access a file reading or
writing from the start of the file to the end. Sometime programs
that do random I/O really skip randomly within the file and other
times they do read a file backwards. Programs that do random I/O
are often a rather difficult way to improve the I/O performance,
and the solution is rarely very simple. I have occasionally seen
that the file being opened is small by today's standards and
can actually fit in memory. I had one program that I worked on a
few years ago that used a 40-MB random I/O file using the C library
package opening and closing it thousands of times so it was never
kept in memory. A simple change I made was to set the buffer size
to the size of the file and remove the opens and closes. Thus, I
dramatically improved wall-time performance and CPU performance.
The issues with random I/O are a bit more complex. You never want
to make requests bigger than the physical request from the application.
If the requests begin and end on 512-byte boundaries, then system
calls are your best choice. If they do not begin and end on 512-byte
boundaries, then using the C library and setting the buffer to the
request size rounded to the next 512-byte boundaries is a better
choice, given the read-modify-write that will be required for the
system calls.
It is important to note that random I/O is not always as random
as you think. Your applications often do what I call "randomly
sequential I/O". A number of requests are made sequentially,
a seek request is made, and I/O are then done sequentially again.
From what I have seen from traces, this is very common in databases,
search engines, and a number of scientific applications. It is more
difficult to improve the performance of programs that do random
I/O than programs that do sequential I/O.
Conclusions
It should be noted that for RAID devices the basic block size
is not 512 bytes, so you can have another level of inefficiency.
This will be covered in a later column. My next two columns will
cover file system tuning, configuration, and performance as we move
down the layers of the hardware and software to the disk.
Henry Newman has worked in the IT industry for more than 20
years. Originally at Cray Research and now with a consulting organization,
he has provided expertise in systems architecture and performance
analysis to customers in government, scientific research, and industry
around the world. His focus is high-performance computing, storage
and networking for UNIX systems, and he previously authored a monthly
column about storage for Server/Workstation Expert magazine.
He may be reached at: hsn@hsnewman.com.
|