The Waker Allocation Problem
— 2025-05-23
- introduction
- efficient concurrent execution requires intermediate wakers
- intermediate wakers require allocations
- conclusion
Introduction
The entire point of futures and Rust’s async/.await system is to introduce two new capabilities: arbitrary concurrent execution of computations, and arbitrary cancellation of computations. The difference between blocking and non-blocking computation is unimportant if we then don’t also capitalize on it by scheduling work concurrently.
But there’s a problem - In order to schedule work concurrently futures have two bad choices in how to organize their internals:
- Without allocations futures have
O(N^2)
(quadratic) wakeup behavior for their child futures. This means that if we schedule 1_000 futures that all wake independently and in sequence, we’ll approximate 1_000_000 total wakes 1. - With allocations futures can have
O(N)
wakeup behavior for their child futures. This means that if we schedule 1_000 futures that all complete independently and in sequence, we’ll have 1_000 total wakes.
In Rust we’re well aware that hidden allocations are bad. But we also know that quadratic scaling on potentially unbounded numbers of items is also bad. And it’s bad that we’re in a position where we have to choose between both.
Efficient concurrent execution requires intermediate wakers
Say we’re writing a manual future::join
function which takes an array of impl Future
and returns an array of Future::Output
. We can model the internal state
of this using an array of futures, and an array of future outputs:
/// A future that concurrently awaits a list of futures
struct Join<Fut: Future, const N: usize> {
/// The futures we're awaiting.
futures: [Fut; N],
/// The outputs of the futures.
outputs: [Option<Fut::Output>; N],
}
In order to concurrently await all futures, we have to implement the Future
trait for Join
. The output type of this is [Future::Output; N]
, with each
call to poll
iterating over each pending future contained within. Here’s an
naive example implementation that, mostly just to provide context:
impl<Fut: Future, const N: usize> Future for Join<Fut, N> {
type Output = [Fut::Output; N];
fn poll(self: Pin<&mut Self>, cx: &mut Context<'_>) -> Poll<Self::Output> {
// setup the initial state
let this = unsafe { self.get_unchecked_mut() };
let mut all_done = true;
// Iterate over every future until we either have all
// values, or we still have futures pending.
for i in 0..this.futures.len() {
// Either the future is already done…
if this.outputs[i].is_some() {
continue;
}
// …or we still need to poll it.
let fut = unsafe { Pin::new_unchecked(&mut this.futures[i]) };
match fut.poll(cx) {
Poll::Ready(output) => this.outputs[i] = Some(output),
Poll::Pending => all_done = false,
}
}
// The loop is done and its time to look at our
// results. We either still have futures pending…
if !all_done {
return Poll::Pending;
}
// …or we're all done and we're ready to return.
let outputs = std::mem::replace(&mut this.outputs, [None; N])
.map(Option::unwrap);
Poll::Ready(outputs)
}
}
If you look at the core of this loop, you’ll notice that we’re iterating over
each future on every iteration - regardless of whether it called its respective
waker of not. The only futures we don't call are the ones we already have an
output of. Practically this means that we approach O(N^2)
execution: we wake
every future on every call to poll. And we do this because we don’t know which
futures were are ready and which are not.
The solution here is to create an “intermediate” or “inline” waker. Rather than
passing the caller’s waker down into all of the child futures, the Join
future
should instead create its own waker(s) which can track which child futures are
ready - and only poll those. The simplest way of doing this is keeping an array of
wakers and an array of “ready” indexes. Whenever a child future calls wake, it
pushes its index to the list of ready indexes, and then calls the parent waker.
Then when the Join
future is polled, all it needs to do is iterate over all
ready indexes and poll those futures. For brevity I won’t bother showing the
full implementation, just the definition itself:
/// A future that concurrently awaits a list of futures
struct Join<Fut: Future, const N: usize> {
/// The futures we're awaiting.
futures: [Fut; N],
/// The outputs of the futures.
outputs: [Option<Fut::Output>; N],
/// Keeps a waker per child future
wakers: [ChildWaker; N],
/// Tracks which child futures should be woken
indexes: Arc<[AtomicBool; N]>,
}
/// A waker associated with a specific child future
struct ChildWaker {
/// The index of the child future in the index list
index: usize,
/// A handle to the parent waker, passed to the struct
parent_waker: Waker,
/// Tracks which child futures should be woken. Intended
/// to be indexed into using `index` inside `Wake`.
indexes: Arc<[AtomicBool; N]>,
}
impl Wake for InlineWaker { ... }
There are more efficient schemes than this one, though this is among the simplest. And crucially: this allows us to only ever wake the futures that need to be woken. Which is important, since waking futures may potentially be expensive. And while we can encourage futures to be cheap to wake spuriously - we should really just not cause spurious wakes in the first place. Because it's better if the burden falls on async experts writing primitives, than on every single person implementing the future trait 2.
Intermediate wakers require allocations
Now that we’ve looked at why futures want to make use of intermediate wakers,
let’s take a closer look at the waker internals. The Future
trait relies on a
type called Waker
, which you can think of as an Arc<Fn>
pointer. Where you
can think of Arc
as a fancy type of Box<T>
that can be freely cloned around
and provides shared (immutable) access to its internals. Here’s how you define a
ew Waker
that will unpark the current thread when called:
use std::sync::Arc;
use std::task::{Context, Wake};
use std::thread::{self, Thread};
use std::pin::pin;
/// A waker that wakes up the current thread when called.
struct ThreadWaker(Thread);
impl Wake for ThreadWaker {
fn wake(self: Arc<Self>) {
self.0.unpark();
}
}
And here is how you construct an instance of that same waker. This relies on a
From<Arc<Wake + Send + Sync + 'static>>
impl
for `Waker:
/// Run a future to completion on the current thread.
fn block_on<T>(fut: impl Future<Output = T>) -> T {
let mut fut = pin!(fut);
// Construct an instance of `ThreadWaker`
let t = thread::current();
let waker = Arc::new(ThreadWaker(t)).into(); // ← This allocates
// ...
Right there, creating an instance of Arc::new
means we’re allocating. In
theory we don’t have to use Arc
for this either. The type
Waker
only requires
that it wraps a
RawWakerVTable
in a way that is Send + Sync + Clone
. That means in theory we could also have
a static
somewhere with internal mutability, and perhaps some (embedded)
systems rely on this for their global runtimes. But in practice for inline
concurrent operations, if we’re going to be creating a Waker
it’s going to
need to be based on an Arc
.
Conclusion
The fact that it practically relies on allocations is not the only problem: in
single-threaded systems Waker
still requires the + Send + Sync
bounds. Even
if we would end up adding a
LocalWaker
type,
that still requires the original Waker
type to be present. Which means we
can’t practically write portable async-based !Send
libraries, because that
bound will never surface in the type system 3.
The reason why we have an Arc<Fn>
to begin with is because the future design assumes we
might want to wrap arbitrary computation in the wakers. But I’m not sure how true that is. For example, in the
futures-concurrency
library - our wakers only ever perform an
operation equivalent to flipping an atomic bool
(src).
If we were hand-rolling our own implementation directly against e.g. the epoll
API, we wouldn’t have to pay any cost like this. What we’ve looked at in this
post is purely an artifact of the Future
trait’s design.
Having to choose between algorithmic complexity and allocations is also the reason why I haven't proposed to upstream futures-concurrency APIs to the stdlib yet. The Rust stdlib has a design principle that disallows hidden allocations. But at work we also know we need to guard against API misuse. That means we’re likely going to hit cases where using the stdlib APIs are actually not recommended. And it’s hard to justify upstreaming something we might immediately want to guard against using.
This is also not the only problem with the Future
trait design. I’ve spoken at
length about the problems the related Pin
API has. And in this post I've also
touched on the !Send
Waker problem. By my count the Future
trait has seven
problems like this, which I’ll try to describe in future posts.
This is not just a hypothetical - its an issue we’ve had at Microsoft. We can’t reasonably expect that every person implementing the future trait is an async expert. And so we need primitives which don’t just pass responsibility for performance to individual users.
To be fair - probably about half that, since futures which have completed aren’t polled again.
I hope I don’t have to explain that having Send
expressed in the type system is a good thing.