Pigweed Eng Blog #5: C++20 coroutines without heap allocation#
Published on 2024-10-07 by Taylor Cramer
Pigweed now provides support for coroutines on embedded targets!
1Coro<Status> Blink(CoroContext&,
2 TimeProvider<SystemClock>& time,
3 Led& led,
4 int times) {
5 for (int i = 0; i < times; ++i) {
6 led.TurnOn();
7 co_await time.WaitFor(500ms);
8 led.TurnOff();
9 co_await time.WaitFor(500ms);
10 }
11 led.TurnOff();
12 co_return OkStatus();
13}
Notably, Pigweed’s pw::async2::Coro
API does not require heap
allocation. This post details how we accompilished this, what challenges
we encountered, and how you can get started using coroutines on embedded.
If you’re:
a C++ embedded developer looking for an ergonomic way to write async code
a Rust
async
/await
user curious about C++a C++ committee member who wants to make embedded developers happy
read on!
Why use coroutines for embedded?#
Coroutines allow for multiple functions to run concurrently. Compared to other concurrency options, coroutines have:
No need for thread stacks
No OS/RTOS-dependent context switching
No complex, manually written state machines
The coroutines API introduced in C++20
allows you to write
straight-line code
that can yield to the caller and resume at some later point. This can be used
similarly to async
/ await
from languages like Rust, C#, JavaScript,
Python, and Swift.
When a typical blocking function waits for a long-running operation to complete, the whole thread (and any resources associated with it) become blocked until the operation is done. By contrast, coroutine functions can yield to their caller and only resume once an asynchronous operation completes. The caller, often some kind of scheduler, is then free to run other code while the coroutine function waits for its result.
This allows for executing many operations concurrently without paying the performance cost of threading or the cognitive cost of callbacks or manual state machines.
Why C++ coroutines require dynamic allocation#
Unfortunately for embedded developers, the C++ coroutine API uses heap
allocation by default. Heap allocation (e.g. new
and delete
or
malloc
and free
) isn’t available on many embedded platforms.
Fortunately, heap allocation is not necessary to use C++ coroutines.
However, C++20 coroutines do require dynamic allocation
(memory that is allocated at runtime). Pigweed’s
pw::async2::Coro
API allocates memory using an
pw::Allocator
. This avoids the need for a platform-provided
heap allocator, but users may still encounter some of the costs associated
with dynamic allocation, such as:
Heap fragmentation
Runtime memory management size and performance overhead
Runtime out-of-memory conditions
The need to “right-size” a heap
This dynamic allocation is done in order to store the “coroutine frame”, an object which holds the coroutine’s state.
The coroutine frame#
Despite not using a native thread stack, C++20 coroutines do need somewhere to
store the “stack” of paused functions. For example, in our Blinky
function
from above:
1Coro<Status> Blink(CoroContext&,
2 TimeProvider<SystemClock>& time,
3 Led& led,
4 int times) {
5 for (int i = 0; i < times; ++i) {
6 led.TurnOn();
7 co_await time.WaitFor(500ms);
8 led.TurnOff();
9 co_await time.WaitFor(500ms);
10 }
11 led.TurnOff();
12 co_return OkStatus();
13}
When we reach a co_await
statement, the function is paused and control
is returned to the scheduler until 500ms
passes. When paused, we still
need to store the function’s state, including the loop variable
i
, the function arguments, and a pointer to where we were in the function
when it paused.
This coroutine state (hereafter referred to as a “coroutine frame”) typically
requires much less memory than a full thread stack, as it can allocate exactly
the amount of memory the function needs for variables that are “live” across
co_await
points.
Layout of coroutine frames
The actual structure of this object is a compiler implementation detail, but you can read more about how Clang structures it in The structure of coroutine frames.
Static vs. dynamic allocation of the coroutine frame#
When a coroutine is first created, the C++ coroutine API dynamically allocates space to store the coroutine frame and returns a pointer to it called a coroutine handle. By contrast, Rust’s async functions directly return coroutine frames inline, allowing for static allocation. Why did C++ make the choice to dynamically allocate? Some key reasons are that returning a pointer to dynamically allocated memory keeps the coroutine function’s return value small, movable, and most importantly ABI-stable.
ABI stability#
ABI stability is a particular challenge when designing coroutines. The coroutine frame contains the stack of the coroutine function, so the coroutine frame’s size is dependent upon the number and size of the variables used in the function implementation.
For example, two functions with this declaration can have differently sized coroutine frames:
Coro<int> GimmeANumber();
An implementation like this would have a very small coroutine frame:
Coro<int> GimmeANumber() {
co_return 5;
}
While this function would need a very large coroutine frame in order to store its temporary state:
Coro<int> GimmeANumber() {
ABigValue a = co_await DownloadAllOfReddit();
AnotherOne b = co_await DownloadAllOfHackerNews();
co_return a.size() + b.size();
}
If we were to inline the coroutine frame in the resulting Coro<int>
object,
that would mean that Coro<int>
would have to be a different size depending
on the function implementation. The two functions would therefore have different
ABIs, and we would no longer be able to refer to Coro<int>
as a single
statically sized type.
This would also make it impossible to call a coroutine function from a different
translation unit without knowing the details of its implementation. Without
additional compiler features, inlined coroutine frames would require that public
coroutine functions be implemented in headers (similar to templated or
constexpr
functions).
Similarly, inlined coroutine frames would prevent coroutine functions from
being virtual
, as different overrides would have different ABIs.
Rust bites the bullet#
Rust inlines the coroutine frame and runs headfirst into these challenges.
The coroutine-style object returned by a Rust async fn
is:
Immovable once the coroutine has started. This is necessary in order to ensure that pointers to variables in the coroutine stack are not invalidated.
Dependent upon the implementation of the
async
function. Adding a new variable in the body of theasync
function will require more “stack” space, so the size and structure of the returned state machine will change as the implementation of theasync
function changes.Not usable with dynamic dispatch (Rust’s
virtual
) without an intermediate layer which dynamically allocates the coroutine frame (see the async_trait library for one implementation of this pattern).Challenging to compile. Since the size of the coroutine object is dependent on the function implementation, code which uses the coroutine function can’t be compiled until the compiler has first determined the size of the coroutine object.
However, this tradeoff also means that Rust can statically allocate space for coroutines, avoiding all the pitfalls of dynamic allocation. The current C++ coroutine API doesn’t allow for this.
What Pigweed provides#
Despite the need for dynamic allocation, it is still possible to use coroutines on embedded devices. Today, Pigweed is using coroutines in the Sense: An interactive tour though Pigweed showcase for the Raspberry Pi RP2040 and (new!) RP2350 microcontrollers.
In order to accomplish this, we needed the Pigweed pw::async2::Coro
API to:
Avoid heap allocation.
Allow recovering from allocation failure without using exceptions.
Allow using custom, context-specific
pw::Allocator
implementations on a per-coroutine basis.
If you want to skip ahead, you can see the Coro implementation here.
Eliminating the heap#
As I said before, the C++ coroutine API defaults to heap allocation.
Fortunately for us embedded devs, the coroutine promise_type
can customize
the behavior of this allocation by implementing the following functions:
static void* operator new(std::size_t, Args...) noexcept
static MyReturnObjectType get_return_object_on_allocation_failure()
static void operator delete(void*)
Allocating memory#
Implementing operator new
lets us control how coroutine memory is allocated.
Our custom operator new
takes two parameters: a std::size_t
and an
Args...
parameter pack. Let’s explore how these are handled.
The size
parameter#
Unlike regular operator new
or placement new,
the promise_type
operator new
function allocates space not just for
the promise_type
itself, but also for the compiler-generated coroutine
stack. The size of this coroutine stack is determined by the compiler and is
optimization-dependent, so it is not statically observable and is only available
at runtime.
The Args...
parameter pack#
The arguments to coroutine functions are also passed into the operator new
of the corresponding promise_type
. This allows operator new
to “intercept”
arguments passed to the coroutine function.
This means that if we have some coroutine function:
MyCoroutineType some_func(int a)
and MyCoroutineType::promise_type
has an operator new
like:
static void* operator new(std::size_t n, int a) noexcept
then the argument a
will be copied into the invocation of operator new
.
pw::async2::Coro
uses this tool to allow the caller of a coroutine
function to specify which memory allocator to use for the coroutine frame.
Different coroutines may want to use memory provided by different allocators
with various memory locations or allocation strategies. When allocating a
regular type T
, we’d typically achieve this by pre-allocating sizeof(T)
with our custom allocator and then passing the resulting memory into the
placement new
function of T
.
However, unlike regular types, coroutine constructors or operator new
are not
directly available to the caller. Instead, the coroutine types are constructed
by the compiler when the coroutine function is called.
Furthermore, C++20 doesn’t provide a way to know ahead of time how much memory
needs to be allocated to store the coroutine state—that information is only
available once our custom operator new
is invoked with a size
value.
Instead, pw::async2::Coro
allows custom per-coroutine Allocator
values to be provided by intercepting an Allocator
argument to the coroutine
function. One can pass an allocator as the first argument of all coroutine
functions. We can then use an Args...
variadic to discard all other
arguments to the coroutine function:
template<typename... Args>
static void* operator new(std::size_t n,
Allocator& alloc,
const Args&...) noexcept {
return alloc.Allocate(n);
}
The user’s coroutine functions will then look like this:
MyCoroutineType some_func(Allocator&, int a) { ... }
The operator new
implementation will then receive the caller-provided Allocator
reference and can use it to allocate the coroutine state.
CoroContext
The allocator argument to pw::async2::Coro
is actually packaged inside a
pw::async2::CoroContext
in order to allow for greater evolvability, but
today this type is a simple wrapper over a pw::Allocator
.
Handling allocation failure#
Many embedded systems both disable exceptions and wish to recover gracefully
from allocation failure. If this is the case, promise_type::operator new
must
be marked noexcept
and return nullptr
on allocation failure.
Note that our coroutine function, however, doesn’t have an obvious way to
return nullptr
:
MyCoroutineType some_func(Allocator&, int a) { ... }
It must produce a value of type MyCoroutineType
.
This is made possible by
promise_type::get_return_object_on_allocation_failure
. When operator new
returns nullptr
, C++ will invoke get_return_object_on_allocation_failure
to produce an “empty” MyCoroutineType
. Coroutine return types must therefore
decide on a sentinel representation to use in order to indicate allocation
failure.
Fortunately, most coroutine return types are simple wrappers around
std::coroutine_handle<my_promise_type>
, which is nullable, so we can use the
nullptr
representation to indicate allocation failure.
The delete
function is missing critical pieces. What now?#
Just as we can control allocation with our custom promise_type::operator new
,
we can control deallocation with promise_type::operator delete
!
But unlike operator new
, delete
cannot intercept function arguments,
nor can it access properties of the coroutine frame. By the time delete
is called, the coroutine frame has already been destroyed. Other parts of
C++ use the destroying_delete
API to allow accessing an object as part
of its deletion, but the coroutine promise_type
doesn’t include such an
API.
This means that delete
has no way to get access to the Allocator
instance from which the memory was allocated. Pigweed’s
pw::Allocator
API does not provide a way to free memory from only
a pointer to the allocated memory; a reference to the Allocator
object is
required (particular Allocator
instances may support this, but the generic
interface does not). pw::async2::Coro
solves this by specifying
operator delete
to be a no-op. Instead, the coroutine memory is freed by
the coroutine handle wrapper type (Coro
) after running
operator delete
:
1 void* address = promise_handle_.address();
2 if (address != nullptr) {
3 pw::allocator::Deallocator& dealloc = promise_handle_.promise().dealloc_;
4 promise_handle_.destroy();
5 promise_handle_ = nullptr;
6 dealloc.Deallocate(address);
7 }
Promise handle address
The standard doesn’t appear to clarify whether or not this usage is supported. I couldn’t find any documentation specifying that the promise handle address must point to the start of the allocated memory (rather than, say, the allocated memory plus eight bytes or something).
In practice, this seems to work just fine: the Pigweed API works in both GCC and Clang, even with sanitizers enabled.
In the future, a guarantee here would be super useful!
Still to go: improvements needed from C++ standard and compilers#
One concern that may hamper adoption of coroutines in embedded applications is the total size of coroutine frames. While these frames are frequently much smaller than the stack sizes of threads, there may be many more of them.
Especially for users who would otherwise write manual state-machine-based or callback-based code, the current size overhead of coroutines may be a limiting factor. Some opportunities for improvement here are discussed below.
Reusing stack space of expired temporaries#
Currently, Clang does not use stack space of temporaries after their
storage duration ends. This means that all variables and temporaries
in a coroutine function must have dedicated storage space in the coroutine frame.
This is unnecessary: many variables are not alive at the same time, and others
may not ever be alive across a co_await
point! In the former case, storage
for these variables could be overlapped, and in the latter, they need not be
stored in the coroutine frame at all.
This results in having a coroutine frame of 88 bytes:
Coro<Status> SimplestCoroWithAwait(CoroContext& cx) {
co_return co_await SimplestCoro(cx);
}
And this code having a coroutine frame of 408 bytes:
Coro<Status> SimplestCoroWithTwoAwaits(CoroContext& cx) {
co_await SimplestCoro(cx);
co_await SimplestCoro(cx);
co_await SimplestCoro(cx);
co_await SimplestCoro(cx);
co_await SimplestCoro(cx);
co_await SimplestCoro(cx);
co_await SimplestCoro(cx);
co_await SimplestCoro(cx);
co_await SimplestCoro(cx);
co_await SimplestCoro(cx);
co_return co_await SimplestCoro(cx);
}
Despite both requiring the same amount of temporary storage.
Discussion on this topic continues in this LLVM issue regarding the missing lifetime markers on temporaries.
A similar issue in Rust was solved thanks to tmandry’s efforts culminating in this PR.
Passing values in and out of coroutines#
Once a coroutine function has been invoked, it can only be advanced
(and completed) via a series of calls to
resume.
Annoyingly, resume
does not accept any arguments and returns void
,
so it does not offer a built-in way to pass arguments into our coroutine
function, nor does it give us a way to return values from our coroutine function.
pw::async2::Coro
solves this by storing a pointer to input
and output storage inside the promise_type
before invoking resume
.
1 // Create the arguments (and output storage) for the coroutine.
2 internal::InOut<T> in_out;
3 internal::OptionalOrDefault<T> return_value;
4 in_out.input_cx = &cx;
5 in_out.output = &return_value;
6 promise_handle_.promise().in_out_ = &in_out;
7
8 // Resume the coroutine, triggering `Awaitable::await_resume()` and the
9 // returning of the resulting value from `co_await`.
10 promise_handle_.resume();
11 if (!promise_handle_.done()) {
12 return Pending();
13 }
14
15 // Destroy the coroutine state: it has completed, and further calls to
16 // `resume` would result in undefined behavior.
17 promise_handle_.Release();
18
19 // When the coroutine completed in `resume()` above, it stored its
20 // `co_return` value into `return_value`. This retrieves that value.
21 return return_value;
This means that every coroutine frame must additionally include space for this
input/output pointer. This wasted overhead seems fairly easy to eliminate if
C++ evolved to allow promise_type
to have associated resume_arg_type
and return_arg_type
values.
Avoiding dynamic allocation for statically known coroutines#
Although the steps described in this blog will allow you to avoid heap
allocation, they still rely on dynamic allocation via a
pw::Allocator
. This means that our application can potentially
fail at runtime by running out of memory for our coroutines.
For code which only instantiates a single known coroutine or fixed set of known coroutines defined within the same translation unit, this is an unnecessary cost. Dynamic allocation creates both runtime inefficiency, developer uncertainty, and risk: it’s no longer possible to statically ensure that your program will be able to run on-device.
It would be fantastic if future C++ standards would make it possible to access the coroutine frame size of specific, known coroutines so that space for them could be statically allocated.
Heap allocation elision optimization (HALO)
Heap allocation elision optimization
can omit the calls to new
and delete
entirely in some cases.
However, this relies on the inlining of several independent functions
to the top-level scope which encompasses the full lifetime of the
coroutine. This isn’t possible in the vast majority of the relevant
embedded use cases where it is common to store a coroutine as part of a
long-lived (often static
) object.
Furthermore, relying on such an optimization would be quite fragile, as the compiler’s choice to inline (or not) is dependent upon a variety of heuristics outside the programmer’s control.
Note, too, that when using custom allocator implementations one has to take care or elision will not be possible: the allocation and deallocation functions must be tagged with the appropriate attributes to tell the compiler that they are safe to optimize away.
Try out Pigweed’s coroutine API now#
To experiment with Pigweed’s pw::async2::Coro
API, you can get
started by cloning our Quickstart repo
for a bare-bones example
or trying out the Sense tutorial
for a full tour of a Pigweed product codebase.