Background

When I started writing code for a new USB filesystem, I was told that I should implement asynchronous I/O by providing aio_read and aio_write file operations. I had assumed that these file operations corresponded to the userland aio_read(3) and aio_write(3) calls.

I started to write the code for my kernel driver, and ran across a couple infrastructure questions. I wanted to know what the system call code was doing before my driver functions were called. My first reaction was to grep for sys_aio_read to find the kernel side of the system call.

No such function exists. My assumption that aio_read(3) and aio_write(3) were system calls was wrong. This led to several questions:

  • What do the userspace aio_read(3) and aio_write(3) functions actually do?
  • How are the kernel driver aio_write and aio_read file operations called?

GNU libc

It turns out that GNU libc implements aio_read(3) and aio_write(3). These functions are actually a userland implementation of asynchronous I/O. When aio_read(3) or aio_write(3) is first called on an open file descriptor, libc creates a new thread and adds the read or write request to that thread's request queue. Subsequent requests also go into the request queue. Higher priority requests are processed first, in the order they are submitted.

The newly created threads simply call blocking read(3) or write(3). This means that kernel-side asynchronous I/O is bypassed. If a character device declares an aio_read or aio_write file operation, those file operation are never used.

libaio

To access kernel-side asynchronous I/O, userspace programs need to include libaio. This library provides wrappers for Linux-specific asynchronous I/O system calls:

  • io_setup
  • io_submit
  • io_destroy
  • io_cancel
  • io_getevents

These functions, found in fs/aio.c, provide the kernel-side asynchronous I/O implementation. They actually call the aio_read and aio_write file operations. The system calls needed unique names so they wouldn't conflict with the GNU libc aio_read(3) and aio_write(3) function calls.

The kernel-side aio implementation is meant to be truly asynchronous. aio_read() and aio_write() are expected to return immediately after they queue the request. When an interrupt in the driver signals that transfer is complete, the driver calls kick_iocb(). This will guarantee that the read_retry() driver function is called within the context of the program that made the initial system call. The driver then copies data into userspace and signals to the aio subsystem that the transaction is complete by calling aio_complete().

Truly asynchronous behavior

One would hope that doing asynchronous I/O in the kernel would be more efficient. There has been some discussion about the subject; however, the discussion is irrelevant unless kernel drivers structure their aio_read and aio_write functions properly.

When I was searching for an asynchronous driver to use as an example, I found tons of aio_read and aio_write file operations that weren't asynchronous. They would simply initiate their transaction and then wait for it to finish, like a simple read file operation.

A truly asynchronous implementation of the aio_read and aio_write function calls would call kick_iocb() or aio_complete() somewhere in the driver. By doing grep -r 'kick_iocb\|aio_complete' in the 2.6.20-rc4 kernel source tree, I came up with the following files:

  • drivers/usb/gadgetfs/inode.c - used on USB devices that run Linux.
  • fs/aio.c - the file that handles the aio syscalls and calls down into the driver's file operations table.
  • fs/block_dev.c - block device driver.
  • fs/direct-io.c
  • fs/nfs/direct.c

In the case of NFS and block devices, you only get asynchronous behavior if the file descriptor has the O_DIRECT flag set. This asks the kernel to attempt to bypass the cache and write into or read from the userspace buffer directly. As the man page for open says, "In general this will degrade performance, but it is useful in special situations, such as when applications do their own caching." Bottom line: the only useful asynchronous I/O implementation in the kernel is in gadgetfs.

Fibrils

Another way of dealing with asynchronous I/O in the kernel is being discussed on the LKML mailing list. The idea is to create a thread with a tiny stack whenever an I/O operation blocks. This light-weight thread is called a fibril. When a fibril blocks, the scheduler will be called, and there is a chance that the userland program that made the syscall will be scheduled. It can continue with other operations while it waits for the I/O operation to complete.

In some ways, this sounds suspiciously similar to GNU libc's userland aio implementation. There may be some performance gain to creating threads in the kernel rather than in userspace, but the kernel developers are still deciding on the details of fibrils.

The benefit of fibrils is that device drivers can write simple blocking code and the kernel will turn that into asynchronous I/O. If fibrils really catches on, I think that libaio and the kernel system calls will fall out of use.

Conclusion

Few people truly understand asynchronous I/O, and documentation on the subject is sparse. The beginning userspace application writer will probably use the GNU libc aio_read(3) and aio_write(3) functions without understanding they are not using the kernel space asynchronous I/O implementation.

The few kernel drivers that implement aio_read and aio_write file operations in a truly asynchronous manner are rarely used and may not be well tested. Until a true performance gain is shown when kernel aio used instead of GNU libc, I would suggest that kernel driver writers write blocking code. The implementation is simpler, and those userspace applications that want asynchronous calls can use GNU libc.