Background
When I started writing code for a new USB filesystem, I was told that I should
implement asynchronous I/O by providing aio_read
and aio_write
file
operations. I had assumed that these file operations corresponded to the
userland aio_read(3)
and aio_write(3)
calls.
I started to write the code for my kernel driver, and ran across a couple
infrastructure questions. I wanted to know what the system call code was doing
before my driver functions were called. My first reaction was to grep for
sys_aio_read
to find the kernel side of the system call.
No such function exists. My assumption that aio_read(3)
and aio_write(3)
were system calls was wrong. This led to several questions:
- What do the userspace
aio_read(3)
andaio_write(3)
functions actually do? - How are the kernel driver
aio_write
andaio_read
file operations called?
GNU libc
It turns out that GNU libc implements aio_read(3)
and aio_write(3)
. These
functions are actually a userland implementation of asynchronous I/O. When
aio_read(3)
or aio_write(3)
is first called on an open file descriptor, libc
creates a new thread and adds the read or write request to that thread's
request queue. Subsequent requests also go into the request queue. Higher
priority requests are processed first, in the order they are submitted.
The newly created threads simply call blocking read(3)
or write(3)
. This
means that kernel-side asynchronous I/O is bypassed. If a character device
declares an aio_read
or aio_write
file operation, those file operation are
never used.
libaio
To access kernel-side asynchronous I/O, userspace programs need to include libaio. This library provides wrappers for Linux-specific asynchronous I/O system calls:
- io_setup
- io_submit
- io_destroy
- io_cancel
- io_getevents
These functions, found in fs/aio.c
, provide the kernel-side asynchronous I/O
implementation. They actually call the aio_read
and aio_write
file
operations. The system calls needed unique names so they wouldn't conflict
with the GNU libc aio_read(3)
and aio_write(3)
function calls.
The kernel-side aio implementation is meant to be truly asynchronous.
aio_read()
and aio_write()
are expected to return immediately after they
queue the request. When an interrupt in the driver signals that transfer is
complete, the driver calls kick_iocb()
. This will guarantee that the
read_retry()
driver function is called within the context of the program that
made the initial system call. The driver then copies data into userspace and
signals to the aio subsystem that the transaction is complete by calling
aio_complete()
.
Truly asynchronous behavior
One would hope that doing asynchronous I/O in the kernel would be more
efficient. There has been some discussion about the subject; however, the
discussion is irrelevant unless kernel drivers structure their aio_read
and
aio_write
functions properly.
When I was searching for an asynchronous driver to use as an example, I found
tons of aio_read
and aio_write
file operations that weren't asynchronous.
They would simply initiate their transaction and then wait for it to finish,
like a simple read
file operation.
A truly asynchronous implementation of the aio_read
and aio_write
function
calls would call kick_iocb()
or aio_complete()
somewhere in the driver. By
doing grep -r 'kick_iocb\|aio_complete'
in the 2.6.20-rc4 kernel source tree,
I came up with the following files:
- drivers/usb/gadgetfs/inode.c - used on USB devices that run Linux.
- fs/aio.c - the file that handles the aio syscalls and calls down into the driver's file operations table.
- fs/block_dev.c - block device driver.
- fs/direct-io.c
- fs/nfs/direct.c
In the case of NFS and block devices, you only get asynchronous behavior if the file descriptor has the O_DIRECT flag set. This asks the kernel to attempt to bypass the cache and write into or read from the userspace buffer directly. As the man page for open says, "In general this will degrade performance, but it is useful in special situations, such as when applications do their own caching." Bottom line: the only useful asynchronous I/O implementation in the kernel is in gadgetfs.
Fibrils
Another way of dealing with asynchronous I/O in the kernel is being discussed on the LKML mailing list. The idea is to create a thread with a tiny stack whenever an I/O operation blocks. This light-weight thread is called a fibril. When a fibril blocks, the scheduler will be called, and there is a chance that the userland program that made the syscall will be scheduled. It can continue with other operations while it waits for the I/O operation to complete.
In some ways, this sounds suspiciously similar to GNU libc's userland aio implementation. There may be some performance gain to creating threads in the kernel rather than in userspace, but the kernel developers are still deciding on the details of fibrils.
The benefit of fibrils is that device drivers can write simple blocking code and the kernel will turn that into asynchronous I/O. If fibrils really catches on, I think that libaio and the kernel system calls will fall out of use.
Conclusion
Few people truly understand asynchronous I/O, and documentation on the subject
is sparse. The beginning userspace application writer will probably use the
GNU libc aio_read(3)
and aio_write(3)
functions without understanding they
are not using the kernel space asynchronous I/O implementation.
The few kernel drivers that implement aio_read
and aio_write
file
operations in a truly asynchronous manner are rarely used and may not be well
tested. Until a true performance gain is shown when kernel aio used instead of
GNU libc, I would suggest that kernel driver writers write blocking code. The
implementation is simpler, and those userspace applications that want
asynchronous calls can use GNU libc.