Updated 15/6/2008: Added proper introduction, general cleanups,
made the problems with POSIX AIO clearer.
Updated 3/7/2008: Select doesn't have a limit on the number of file
descriptors it can handle. (What was I thinking?)
"Asynchronous I/O" essentially refers to the ability of a process to perform input/output on multiple sources at one time. More specifically it's about doing I/O when data is actually available (in the case of input) or when output buffers are no longer full, rather than just performing a read/write operation and blocking as a result. This in itself is not so difficult, but typically there are several channels through which I/O must be performed and the key is to monitor these multiple channels simultaneously.
Consider the case of a web server with multiple clients connected. There is one network (socket) channel and probably also one file channel for each client (the files must be read, and the data must be passed to the client over the network). One problem is, how to determine which client socket to send information to next - since, if we send on a channel whose output buffer is full, we will pointlessly block and delay sending of information to other clients needlessly.
In general Asynchronous I/O revolves around two functions: The ability to determine that input or output is immediately possible (in the case of a network connection, terminal, or certain other devices) or that a pending I/O operation has completed. Both cases are examples of asynchronous events, that is, they can happen at any time during program execution, and the process need not actually be waiting for it to happen (though it can do so). The distinction between the two is largely a matter of operating mode (it is the difference between performing a read operation, for example, and being notified when the data is in the application's buffer, compared to simply being notified when the data is available and asking that it be copied to the application's buffer afterwards). Note however that the first case is arguably preferable since it potentially avoids a redundant copy operation.
However I/O isn't the only thing that can happen asynchronously; unix signals can arrive, mutexes can be acquired/released, file locks can be obtained, sync() calls might complete, etc. All these things are also asynchronous events that may need to be dealt with.
There are several ways to deal with asynchronous events on linux; all of them presently have at least some minor problems, mainly due to limitations in the kernel.
The use of multiple threads is in some ways an ideal solution to the problem of asynchronous I/O, as well as asynchronous event handling in general, thought it has significant problems for practical application due to the fact that each thread requires a stack (and therefore consumes a certain amount of memory) and the number of threads of in a process may be limited by this and other factors. Thus, it may be impractical to assign one thread to each event of interest.
However, threading is presently the only way to deal certain kinds of asynchronous operation (obtaining file locks, for example; the "fcntl" function either returns immediately with success or failure, after a timeout, or after success - there is no ability to initiate the locking without waiting for it to either complete or timeout).
Threading can potentially be combined with other types of asynchronous event handling, to allow asynchronous operations where it is otherwise impossible (file locks etc).
Signals can be sent between unix processes by using kill()
as documented in the libc manual, or between threads using pthread_kill().
There are also the so-called "real-time" signal interfaces
described here. Most importantly, signals
can be sent automatically when certain asynchronous events occur; the
details are discussed later - for now it's important to understand how
signals need to be handled.
Signal handlers as an asynchronous event notification mechanism work reasonably well, but because they are truly executed asynchronously there is a limit to what they can actually do (there are a limited number of C library functions which can be called safely from within a signal handler, for instance). A typical signal handler, therefore, often simply sets a flag which the program tests at prudent times during its normal execution.
The previous problem, in fact, alludes to the real issue with asynchronous
event processing: generally, although the events themselves are asynchronous,
we want to deal with them one at a time in a synchronous manner. With
signals this is possible using the flag technique already described, which
can be combined with sigsuspend() (and variants) to allow
waiting for such events without going into a busy flag-polling loop. For
waiting on signals and other asynchronous event types at the same time,
other solutions are required.
It is possible to open a file (or device) in "non-blocking" mode by using the O_NONBLOCK option in the call to open. You can also set non-blocking mode on an already open file using the fcntl call. Both of these options are documented in the GNU libc documentation.
The result of opening a file in non-blocking mode is that calls to read() and write() will return with an error if they are unable to proceed immediately, ie. if there is no data available to read (yet) or the write buffer is full. This makes it possible to continuously loop through the interesting file descriptors and check for available input (or check for readiness for output) simply by attempting a read (or write). This technique is called polling and is problematic primarily because it needlessly consumes CPU time - that is, the program never blocks, even when no input or output is possible on any file descriptor.
A more subtle problem with non-blocking I/O is that it generally doesn't work with regular files (this is true on linux, even when files are opened with O_DIRECT; possibly not on other operating systems). That is, opening a regular file in non-blocking mode has no effect for regular files: a read will always actually read some of the file, even if the program blocks in order to do so. In some cases this may not be important, seeing as file I/O is generally fast enough so as to not cause long blocking periods (so long as the file is local and not on a network, or a slow medium). However, it is a general weakness of the technique.
(Note, on the other hand, I'm not necessarily advocating that non-blocking I/O of this kind should actually be possible on regular files. The paradigm itself is flawed in this case; why should data ever be made available to read, for instance, unless there is a definite request for it and somewhere to put it? The non-blocking read itself does not serve as such a request, when considered for what it really is: two separate operations, the first being "check whether data is available" and the second being "read it if so").
Note the O_NONBLOCK also causes the open() call itself to be non-blocking for certain types of device (modems are the primary example in the GNU libc documentation). Unfortunately, there doesn't seem to exist a mechanism by which you can execute an open() call in a truly non-blocking manner for regular files (which again, might be particularly desirable for files on a network). The only solution here is to use threads, one for each simultaneous open() operation.
It's clear that, even if non-blocking I/O were usable with regular files, it would only go part-way to solving the asynchronous I/O problem; it provides a mechanism to poll a file descriptor for data, but no mechanism for asynchronous notification of when data is available. To deal with multiple file descriptors a program would need to poll them in a loop, which is wasteful of processor time. The select() function, discussed in the next section, helps to solve this problem.
The select() function is documented in the libc manual. As
noted, a file descriptor for a regular file is considered ready for reading if
it's not at end-of-file and is always considered ready for writing (the man
page for select in the Linux manpages neglects to mention both these facts).
As with non-blocking I/O, select is no solution for regular files (which may
be on a network or slow media).
While select() is
interruptible by signals, it is not trivial to use plain select()
to wait for both signal and I/O readiness events when using the typical signal
technique of having the signal handler simply set a flag. The flag can be
checked before the select() call, and also if the call returns with EINTR
(interrupted
by a signal); however, this leaves a small window of time between the checking
of the flag and the select() call being entered where the signal can occur.
In this case, the signal will not be detected until after select() returns
from an I/O readiness event or another signal (or a timeout), which may cause
the process to block needlessly.
The pselect() call (not documented in the GNU libc manual) allows
atomically unmasking a signal and performing a select() operation (the signal
mask is also restored before pselect returns); this allows waiting for one
of either a specific signal or an I/O readiness event. It is possible to
achieve the same thing without using pselect() by having the signal handler
generate an I/O readiness event that the select() call will notice (for
instance by writing a byte to a pipe, thereby making data available on the
other end).
Finally, select (and pselect) aren't particularly good from a performance
standpoint because of the way the file descriptor sets are passed in (as a
bitmask). The kernel is forced to scan the mask up to the supplied nfds
argument in order to check which descriptors the userspace process is actually
interested in. The poll() function, not documented in the
GNU libc manual, is an alternative to select() which uses a
variable sized array to hold the relevant file descriptors instead of a fixed
size structure. This removes the limit on the number of file descriptors.
#include <sys/poll.h> int poll(struct pollfd *ufds, unsigned int nfds, int timeout);
The structure struct pollfd is defined as:
struct pollfd {
int fd; // the relevant file descriptor
short events; // events we are interested in
short revents; // events which occur will be marked here
};
The events and revents are bitmasks with a combination
of any of the following values:
POLLIN- there is data available to be read
POLLPRI- there is urgent data to read
POLLOUT- writing now will not block
If the feature test macros are set for XOpen, the following are also available. Although they have different bit values, the meanings are essentially the same:
POLLRDNORM- data is available to be read
POLLRDBAND- there is urgent data to read
POLLWRNORM- writing now will not block
POLLWRBAND- writing now will not block
Just to be clear on this, when it is possible to write to an fd without blocking, all three of POLLOUT, POLLWRNORM and POLLWRBAND will be generated. There is no functional distinction between these values.
The following is also enabled for GNU source:
POLLMSG - a system message is available; this is used for
dnotify and possibly other functions. If POLLMSG is set then POLLIN and
POLLRDNORM will also be set.
... However, the Linux man page for poll() states that Linux "knows about but does not use" POLLMSG.
The following additional values are not useful in events but may
be returned in revents, i.e. they are implicitly polled:
ThePOLLERR- an error condition has occurred
POLLHUP- hangup or disconnection of communications link
POLLNVAL- file descriptor is not open
nfds argument should provide the size of the ufds
array, and the timeout is specified in milliseconds.
The return from poll() is the number of file descriptors for
which a watched event occurred (that is, an event which was set in the
events field in the struct pollfd structure, or
which was one of POLLERR, POLLHUP or POLLNVAL).
The return may be 0 if the timeout was reached. The return is -1 if an error
occurred, in which case errno will be set to one of the following:
EBADF- a bad file descriptor was given
ENOMEM- there was not enough memory to allocate file descriptor tables, necessary forpoll()to function.
EFAULT- the specified array was not contained in the calling process's address space.
EINTR- a signal was received while waiting for events.
EINVAL- if thenfdsis ridiculously large, that is, larger than the number of fds the process is allowed to have open. Note that this implies it may be unwise to add the same fd to the listen set twice.
Note that poll() exhibits the same problems in waiting for
signals that select() does. There is a ppoll()
function in more recent kernels (2.6.16+) which changes the timeout argument
to a struct timespec * and which adds a sigset_t *
argument to take the desired signal mask during the wait (this function
is documented in the Linux man pages).
The poll call is inefficient for large numbers of file descriptors, because the kernel must scan the list provided by the process each time poll is called, and the process must scan the list to determine which descriptors were active. Also, poll exhibits the same problems in dealing with regular files as select() does (files are considered always ready for reading, except at end-of-file, and always ready for writing).
It might seem that poll()/select() make non-blocking I/O mode irrelevant (seeing as it's possible to check if immediate I/O is possible); however, it's worth using non-blocking mode for added safety in the event of a spurious readiness event (apparently these can occur, for instance, when data arrives over a network interface but is discarded due to checksum mismatch).
File descriptors can be set to generate a signal when an I/O readiness
event occurs on them - except for those which refer to regular files (which
should not be surprising by now). This allows using sleep(),
pause() or sigsuspend() to wait for both signals
and I/O readiness events, rather than using select()/poll().
The GNU libc documentation has some information on using SIGIO. It tells how
you can use the F_SETOWN argument to fcntl() in
order to specify which process should recieve the SIGIO signal for a given
file descriptor. However, it does not mention that on linux you can also use
fcntl() with F_SETSIG to specify an alternative
signal, including a realtime signal. Usage is as
follows:
fcntl(fd, F_SETSIG, signum);
... where fd is the file descriptor and signum is the signal number you want
to use. Setting signum to 0 restores the default behaviour
(send SIGIO). Setting it to non-zero has the effect of causing the specified
signal to be queued when an I/O readiness event occurs, if the specified
signal is a non-realtime signal which is already pending (? I need to check
this - didn't I mean if it is a realtime signal?). If the signal cannot
be queued a SIGIO is sent in the traditional manner.
This technique cannot be used with regular files.
If a signal is successfully queued due to an I/O readiness event, additional
signal handler information becomes available to advanced signal handlers
(see the link on realtime signals above for more information). Specifically
the handler will see si_code (in the siginfo_t
structure) with one of the following values:
POLL_IN- data is available
POLL_OUT- output buffers are available (writing will not block)
POLL_MSG- system message available
POLL_ERR- input/output error at device level
POLL_PRI- high priority input available
POLL_HUP- device disconnected
Note these values are not necessarily distinct from other values used by the kernel in sending signals. So it is advisable to use a signal which is used for no other purpose. Assuming that the signal is generated to indicate an I/O event, the following two structure members will be available:
si_band- contains the event bits for the relevant fd, the same as would be seen usingpoll()
si_fd- contains the relevant fd.
Together with poll(), the signal technique can be used to
reliably wait on a set of events including both I/O readiness events and
signals, even if ppoll()/pselect() are unavailable, and without resorting to
having the signal handler write to a pipe. If you want to go down this route,
be careful to avoid potential race conditions! As always, it is probably
advisable to also use non-blocking I/O. Use sigwaitinfo() to
wait for signals including SIGIO and the chosen realtime I/O signals.
Note it is possible to assign different signals to different fd's, up to the point that you run out of signals. There is little to be gained from doing so however (it might lead to less SIGIO-yielding signal buffer overflows, but not by much, seeing as buffers are per-process rather than per-signal. I think).
Note that SIGIO can itself be selected as the notification signal. This allows the assosicated extra data to be retrieved, however, multiple SIGIO signals will not be queued and there is no way to detect if signals have been lost, so it is necessary to treat each SIGIO as an overflow regardless. It's much better to use a real-time signal. If you do, you potentially have an asynchronous event handling scheme which in some cases may be more efficient than using poll() and even epoll(), which will now be discussed.
poll(), except that the array of file
descriptors is maintained in the kernel rather than userspace. Syscalls are
available to create a set, add and remove fds from the set, and retrieve
events from the set. This is much more efficient than traditional
poll() as it prevents the linear scanning of the set required at
both the kernel and userspace level for each poll() call.
#include <sys/epoll.h> int epoll_create(int size); int epoll_ctl(int epfd, int op, int fd, struct epoll_event *event); int epoll_wait(int epfd, struct epoll_event *events, int maxevents, int timeout);
epoll_create() is used to create a poll set. The size
argument is an indicator only; it doesn not limit the number of fds which can
be put into the set. The return value is a file descriptor (used to identify
the set) or -1 if an error occurs (the only possible error is ENOMEM
which indicates there is not enough memory or address space to create the
set in kernel space). An epoll file descriptor is deleted by calling close()
and otherwise acts as an I/O file descriptor which has input available if
an event is active on the set.
epoll_ctl is used to add, remove, or otherwise control the
monitoring of an fd in the set donated by the first argument, epfd.
The op argument specifies the operation which can be any of:
EPOLL_CTL_ADDfd
argument specifies the fd to add. The event argument points to
a struct epoll_event structure with the following members:
uint32_t eventspoll() events, though
they are named with an EPOLL prefix: EPOLLIN,
EPOLLPRI, EPOLLOUT, EPOLLRDNORM,
EPOLLRDBAND, EPOLLWRNORM, EPOLLWRBAND,
EPOLLMSG, EPOLLERR, and EPOLLHUP.
Two additional flags are possible: EPOLLONESHOT, which sets
"One shot" operation for this fd, and EPOLLET, which
sets edge-triggered mode (see the description of epoll_wait for
more information).
In one-shot mode, a file descriptor generates an event only once. After that,
the bitmask for the file descriptor is cleared, meaning that no further events
will be generated unless EPOLL_CTL_MOD is used to re-enable
some events.
epoll_data_t data
void *ptr; int fd; uint32_t u32; uint64_t u64;
EPOLL_CTL_MODEPOLL_CTL_ADD.
EPOLL_CTL_DELdata argument is ignored.
The return is 0 on success or -1 on failure, in which case errno
is set to one of the following:
EBADF - the epfd argument is not a valid file descriptor
EPERM - the target fd is not supported by the epoll interface
EINVAL - the epfd argument is not an epoll set descriptor,
or the operation is not supported
ENOMEM - there is insufficient memory or address space to handle
the request
The epoll_wait() call is used to read events from the fd set.
The epfd argument identifies the epoll set to check. The
events argument is a pointer to an array of struct epoll_event
structures (format specified above) which contain both the user data
associated with a file descriptor (as supplied with epoll_ctl())
and the events on the fd. The size of the array is given by the maxevents
argument. The timeout argument specifies the time to wait
for an event, in milliseconds; a value of -1 means to wait indefinitely.
In edge-triggered mode, an event is reported only once for each time the
readiness state changes from inactive to active, that is, from the sitation
being absent to being present. If the situation is not removed once the
event is reported by epoll_wait(), it will not be reported again;
this is as opposed to the default triggering mode, "level triggered",
where the readiness state will be reported each time epoll_wait()
is called.
The return is 0 on success or -1 on failure, in which case errno
is set to one of:
EBADF - the epfd argument is not a valid file descriptor
EINVAL - epfd is not an epoll set descriptor, or maxevents is less than 1
EFAULT - the memory area occupied by the specified array is not accessible with write permissions
Note that an epoll set descriptor can be used much like a regular file
descriptor. That is, it can be made to generate SIGIO (or another signal)
when input (i.e. events) is available on it; likewise it can be used with
poll() and can even be stored inside another epoll set.
Epoll is fairly efficient, but it still won't work with regular files. Also, adding/removing fds from a set might perform linearly on the size of the set (depending on the implementation in the kernel).
The POSIX asynchronous I/O interface, which is documented in the GNU libc manual, would seem to be almost ideal for performing asynchronous I/O. After all, that's what it was designed for. But if you think that this is the case, you're in for bitter disappointment.
The documentation in the GNU libc manual (v2.3.1) is not complete - it doesn't document the "struct sigevent" structure used to control how notification of completed requests is performed. The structure has the following members:
int sigev_notify - can be set to SIGEV_NONE (no
notification), SIGEV_THREAD (a thread is started, executing
function sigev_notify_function), or SIGEV_SIGNAL
(a signal, identified by sigev_signo, is sent). SIGEV_SIGNAL can be combined
with SIGEV_THREAD_ID in which case the signal will be delivered to a specific
thread, rather than the process. The thread is identified by the
_sigev_un._tid member - this is an obviously undocumented
feature and possibly an unstable interface.
Note that in particular, "sigev_value" and "sigev_notify_attributes" are not documented in the libc manual, and the types of none of the fields is specified.
Unfortunately POSIX AIO on linux is implemented at user level, using threads! (Actually, there is an AIO implementation in the kernel. I believe it's been in there since sometime in the 2.5 series. But it may have certain limitations - see here - I've yet to ascertain current status, but I believe it's not complete, and I don't believe Glibc uses it).
But there's a much more significant problem: The POSIX AIO API is totally screwed. The people who came up with it were on drugs or something. Really. I'll go through various issues, starting with the ones that aren't so bad and ending with the rool doozies.
lio_listio() is useless. At least, I can't think of any situations
where you'd want to submit a whole bunch of requests at one time.
aio_suspend(), while it might seem to solve the issue of
notification, requires scanning the list of the aiocb structures by the kernel
(to determine whether any of them have completed) and the userspace process
(to find which one completed). That is to say, it has exactly the same
problems as poll().
Also it has the potential signal race problem
discussed previously (which can be worked around by having the signal handler
write to a pipe which is being monitored by the aio_suspend call).
In short, it's a bunch of crap.
... is yet to arrive. I'll examine the state of AIO support in recent kernel versions someday (its API looks, thankfully, a lot better than POSIX AIO, but it may still be lacking a lot of functionality).
There's occasionally talk of trying to improve the situation, but progress has been far, far slower than I'd like.
For the record, though, I think that the real solution:
If I really had my way, heads would roll. Who the hell writes these Posix standards??
-- Davin McCall
Links and references:
Richard Gooch's I/O event handling (2002)
POSIX Asynchronous I/O for Linux - unclear whether this works with recent kernels
Yet to be discussed: eventfd, signalfd, current kernel AIO support, syslets/threadlets, kevents, timers including timerfd and setitimer