The Waker Allocation Problem
— 2025-05-23

Introduction

The entire point of futures and Rust’s async/.await system is to introduce two new capabilities: arbitrary concurrent execution of computations, and arbitrary cancellation of computations. The difference between blocking and non-blocking computation is unimportant if we then don’t also capitalize on it by scheduling work concurrently.

But there’s a problem - In order to schedule work concurrently futures have two bad choices in how to organize their internals:

Without allocations futures have O(N^2)(quadratic) wakeup behavior for their child futures. This means that if we schedule 1_000 futures that all wake independently and in sequence, we’ll approximate 1_000_000 total wakes ¹.
With allocations futures can have O(N) wakeup behavior for their child futures. This means that if we schedule 1_000 futures that all complete independently and in sequence, we’ll have 1_000 total wakes.

In Rust we’re well aware that hidden allocations are bad. But we also know that quadratic scaling on potentially unbounded numbers of items is also bad. And it’s bad that we’re in a position where we have to choose between both.

Efficient concurrent execution requires intermediate wakers

Say we’re writing a manual future::join function which takes an array of impl Future and returns an array of Future::Output. We can model the internal state of this using an array of futures, and an array of future outputs:

/// A future that concurrently awaits a list of futures
struct Join<Fut: Future, const N: usize> {
    /// The futures we're awaiting.
    futures: [Fut; N],
    /// The outputs of the futures.
    outputs: [Option<Fut::Output>; N],
}

In order to concurrently await all futures, we have to implement the Future trait for Join. The output type of this is [Future::Output; N], with each call to poll iterating over each pending future contained within. Here’s an naive example implementation that, mostly just to provide context:

impl<Fut: Future, const N: usize> Future for Join<Fut, N> {
    type Output = [Fut::Output; N];

    fn poll(self: Pin<&mut Self>, cx: &mut Context<'_>) -> Poll<Self::Output> {
        // setup the initial state
        let this = unsafe { self.get_unchecked_mut() };
        let mut all_done = true;

        // Iterate over every future until we either have all
        // values, or we still have futures pending.
        for i in 0..this.futures.len() {
            // Either the future is already done…
            if this.outputs[i].is_some() {
                continue;
            }

            // …or we still need to poll it.
            let fut = unsafe { Pin::new_unchecked(&mut this.futures[i]) };
            match fut.poll(cx) {
                Poll::Ready(output) => this.outputs[i] = Some(output),
                Poll::Pending => all_done = false,
            }
        }

        // The loop is done and its time to look at our
        // results. We either still have futures pending…
        if !all_done {
            return Poll::Pending;
        }

        // …or we're all done and we're ready to return.
        let outputs = std::mem::replace(&mut this.outputs, [None; N])
            .map(Option::unwrap);
        
        Poll::Ready(outputs)
    }
}

If you look at the core of this loop, you’ll notice that we’re iterating over each future on every iteration - regardless of whether it called its respective waker of not. The only futures we don't call are the ones we already have an output of. Practically this means that we approach O(N^2) execution: we wake every future on every call to poll. And we do this because we don’t know which futures were are ready and which are not.

The solution here is to create an “intermediate” or “inline” waker. Rather than passing the caller’s waker down into all of the child futures, the Join future should instead create its own waker(s) which can track which child futures are ready - and only poll those. The simplest way of doing this is keeping an array of wakers and an array of “ready” indexes. Whenever a child future calls wake, it pushes its index to the list of ready indexes, and then calls the parent waker. Then when the Join future is polled, all it needs to do is iterate over all ready indexes and poll those futures. For brevity I won’t bother showing the full implementation, just the definition itself:

/// A future that concurrently awaits a list of futures
struct Join<Fut: Future, const N: usize> {
    /// The futures we're awaiting.
    futures: [Fut; N],
    /// The outputs of the futures.
    outputs: [Option<Fut::Output>; N],
    /// Keeps a waker per child future
    wakers: [ChildWaker; N],
    /// Tracks which child futures should be woken
    indexes: Arc<[AtomicBool; N]>,
}

/// A waker associated with a specific child future
struct ChildWaker {
    /// The index of the child future in the index list
    index: usize,
    /// A handle to the parent waker, passed to the struct
    parent_waker: Waker,
    /// Tracks which child futures should be woken. Intended
    /// to be indexed into using `index` inside `Wake`.
    indexes: Arc<[AtomicBool; N]>,
}
impl Wake for InlineWaker { ... }

There are more efficient schemes than this one, though this is among the simplest. And crucially: this allows us to only ever wake the futures that need to be woken. Which is important, since waking futures may potentially be expensive. And while we can encourage futures to be cheap to wake spuriously - we should really just not cause spurious wakes in the first place. Because it's better if the burden falls on async experts writing primitives, than on every single person implementing the future trait ².

Intermediate wakers require allocations

Now that we’ve looked at why futures want to make use of intermediate wakers, let’s take a closer look at the waker internals. The Future trait relies on a type called Waker, which you can think of as an Arc<Fn> pointer. Where you can think of Arc as a fancy type of Box<T> that can be freely cloned around and provides shared (immutable) access to its internals. Here’s how you define a ew Waker that will unpark the current thread when called:

use std::sync::Arc;
use std::task::{Context, Wake};
use std::thread::{self, Thread};
use std::pin::pin;

/// A waker that wakes up the current thread when called.
struct ThreadWaker(Thread);

impl Wake for ThreadWaker {
    fn wake(self: Arc<Self>) {
        self.0.unpark();
    }
}

And here is how you construct an instance of that same waker. This relies on a From<Arc<Wake + Send + Sync + 'static>> impl for `Waker:

/// Run a future to completion on the current thread.
fn block_on<T>(fut: impl Future<Output = T>) -> T {
    let mut fut = pin!(fut);

    // Construct an instance of `ThreadWaker`
    let t = thread::current();
    let waker = Arc::new(ThreadWaker(t)).into(); // ← This allocates

    // ...

Right there, creating an instance of Arc::new means we’re allocating. In theory we don’t have to use Arc for this either. The type Waker only requires that it wraps a RawWakerVTable in a way that is Send + Sync + Clone. That means in theory we could also have a static somewhere with internal mutability, and perhaps some (embedded) systems rely on this for their global runtimes. But in practice for inline concurrent operations, if we’re going to be creating a Waker it’s going to need to be based on an Arc.

Conclusion

The fact that it practically relies on allocations is not the only problem: in single-threaded systems Waker still requires the + Send + Sync bounds. Even if we would end up adding a LocalWaker type, that still requires the original Waker type to be present. Which means we can’t practically write portable async-based !Send libraries, because that bound will never surface in the type system ³.

The reason why we have an Arc<Fn> to begin with is because the future design assumes we might want to wrap arbitrary computation in the wakers. But I’m not sure how true that is. For example, in the futures-concurrency library - our wakers only ever perform an operation equivalent to flipping an atomic bool (src). If we were hand-rolling our own implementation directly against e.g. the epoll API, we wouldn’t have to pay any cost like this. What we’ve looked at in this post is purely an artifact of the Future trait’s design.

Having to choose between algorithmic complexity and allocations is also the reason why I haven't proposed to upstream futures-concurrency APIs to the stdlib yet. The Rust stdlib has a design principle that disallows hidden allocations. But at work we also know we need to guard against API misuse. That means we’re likely going to hit cases where using the stdlib APIs are actually not recommended. And it’s hard to justify upstreaming something we might immediately want to guard against using.

This is also not the only problem with the Future trait design. I’ve spoken at length about the problems the related Pin API has. And in this post I've also touched on the !Send Waker problem. By my count the Future trait has seven problems like this, which I’ll try to describe in future posts.

This is not just a hypothetical - its an issue we’ve had at Microsoft. We can’t reasonably expect that every person implementing the future trait is an async expert. And so we need primitives which don’t just pass responsibility for performance to individual users.

To be fair - probably about half that, since futures which have completed aren’t polled again.

I hope I don’t have to explain that having Send expressed in the type system is a good thing.

The Waker Allocation Problem— 2025-05-23

Introduction

Efficient concurrent execution requires intermediate wakers

Intermediate wakers require allocations

Conclusion

The Waker Allocation Problem
— 2025-05-23