Size optimizations#
This page contains recommendations for optimizing the size of embedded software including its memory and code footprints.
These recommendations are subject to change as the C++ standard and compilers evolve, and as the authors continue to gain more knowledge and experience in this area. If you disagree with recommendations, please discuss them with the Pigweed team, as we’re always looking to improve the guide or correct any inaccuracies.
Compile Time Constant Expressions#
The use of constexpr and soon with C++20 consteval can enable you to evaluate the value of a function or variable more at compile-time rather than only at run-time. This can often not only result in smaller sizes but also often times more efficient, faster execution.
We highly encourage using this aspect of C++, however there is one caveat: be careful in marking functions constexpr in APIs which cannot be easily changed in the future unless you can prove that for all time and all platforms, the computation can actually be done at compile time. This is because there is no “mutable” escape hatch for constexpr.
See the Embedded C++ Guide for more detail.
Templates#
The compiler implements templates by generating a separate version of the function for each set of types it is instantiated with. This can increase code size significantly.
Be careful when instantiating non-trivial template functions with multiple types.
Consider splitting templated interfaces into multiple layers so that more of the
implementation can be shared between different instantiations. A more advanced
form is to share common logic internally by using default sentinel template
argument value and ergo instantation such as pw::Vector
’s
size_t kMaxSize = vector_impl::kGeneric
or pw::span
’s
size_t Extent = dynamic_extent
.
Virtual Functions#
Virtual functions provide for runtime polymorphism. Unless runtime polymorphism is required, virtual functions should be avoided. Virtual functions require a virtual table and a pointer to it in each instance, which all increases RAM usage and requires extra instructions at each call site. Virtual functions can also inhibit compiler optimizations, since the compiler may not be able to tell which functions will actually be invoked. This can prevent linker garbage collection, resulting in unused functions being linked into a binary.
When runtime polymorphism is required, virtual functions should be considered. C alternatives, such as a struct of function pointers, could be used instead, but these approaches may offer no performance advantage while sacrificing flexibility and ease of use.
Only use virtual functions when runtime polymorphism is needed. Lastly try to avoid templated virtual interfaces which can compound the cost by instantiating many virtual tables.
Devirtualization#
When you do use virtual functions, try to keep devirtualization in mind. You can
make it easier on the compiler and linker by declaring class definitions as
final
to improve the odds. This can help significantly depending on your
toolchain.
If you’re interested in more details, this is an interesting deep dive.
Initialization, Constructors, Finalizers, and Destructors#
Constructors#
Where possible consider making your constructors constexpr to reduce their
costs. This also enables global instances to be eligible for .data
or if
all zeros for .bss
section placement.
Static Destructors And Finalizers#
For many embedded projects, cleaning up after the program is not a requirement,
meaning the exit functions including any finalizers registered through
atexit
, at_quick_exit
, and static destructors can all be removed to
reduce the size.
The exact mechanics for disabling static destructors depends on your toolchain.
See the Ignored Finalizer and Destructor Registration section below for
further details regarding disabling registration of functions to be run at exit
via atexit
and at_quick_exit
.
Clang#
With modern versions of Clang you can simply use -fno-C++-static-destructors
and you are done.
GCC with newlib-nano#
With GCC this is more complicated. For example with GCC for ARM Cortex M devices
using newlib-nano
you are forced to tackle the problem in two stages.
First, there are the destructors for the static global objects. These can be
placed in the .fini_array
and .fini
input sections through the use of
the -fno-use-cxa-atexit
GCC flag, assuming newlib-nano
was configured
with HAVE_INITFINI_ARAY_SUPPORT
. The two input sections can then be
explicitly discarded in the linker script through the use of the special
/DISCARD/
output section:
/DISCARD/ : {
/* The finalizers are never invoked when the target shuts down and ergo
* can be discarded. These include C++ global static destructors and C
* designated finalizers. */
*(.fini_array);
*(.fini);
Second, there are the destructors for the scoped static objects, frequently
referred to as Meyer’s Singletons. With the Itanium ABI these use
__cxa_atexit
to register destruction on the fly. However, if
-fno-use-cxa-atexit
is used with GCC and newlib-nano
these will appear
as __tcf_
prefixed symbols, for example __tcf_0
.
There’s an interesting proposal (P1247R0) to
enable [[no_destroy]]
attributes to C++ which would be tempting to use here.
Alas this is not an option yet. As mentioned in the proposal one way to remove
the destructors from these scoped statics is to wrap it in a templated wrapper
which uses placement new.
#include <type_traits>
template <class T>
class NoDestroy {
public:
template <class... Ts>
NoDestroy(Ts&&... ts) {
new (&static_) T(std::forward<Ts>(ts)...);
}
T& get() { return reinterpret_cast<T&>(static_); }
private:
std::aligned_storage_t<sizeof(T), alignof(T)> static_;
};
This can then be used as follows to instantiate scoped statics where the destructor will never be invoked and ergo will not be linked in.
Foo& GetFoo() {
static NoDestroy<Foo> foo(foo_args);
return foo.get();
}
Strings#
Tokenization#
Instead of directly using strings and printf, consider using pw_tokenizer to replace strings and printf-style formatted strings with binary tokens during compilation. This can reduce the code size, memory usage, I/O traffic, and even CPU utilization by replacing snprintf calls with simple tokenization code.
Be careful when using string arguments with tokenization as these still result in a string in your binary which is appended to your token at run time.
String Formatting#
The formatted output family of printf functions in <cstdio>
are quite
expensive from a code size point of view and they often rely on malloc. Instead,
where tokenization cannot be used, consider using pw_string’s
utilities.
Removing all printf functions often saves more than 5KiB of code size on ARM
Cortex M devices using newlib-nano
.
Logging & Asserting#
Using tokenized backends for logging and asserting such as pw_log_tokenized coupled with pw_assert_log can drastically reduce the costs. However, even with this approach there remains a callsite cost which can add up due to arguments and including metadata.
Try to avoid string arguments and reduce unnecessary extra arguments where possible. And consider adjusting log levels to compile out debug or even info logs as code stabilizes and matures.
Future Plans#
Going forward Pigweed is evaluating extra configuration options to do things such as dropping log arguments for certain log levels and modules to give users finer grained control in trading off diagnostic value and the size cost.
Threading and Synchronization Cost#
Lighterweight Signaling Primatives#
Consider using pw::sync::ThreadNotification
instead of semaphores as they
can be implemented using more efficient RTOS specific signaling primitives. For
example on FreeRTOS they can be backed by direct task notifications which are
more than 10x smaller than semaphores while also being faster.
Threads and their stack sizes#
Although synchronous APIs are incredibly portable and often easier to reason about, it is often easy to forget the large stack cost this design paradigm comes with. We highly recommend watermarking your stacks to reduce wasted memory.
Our snapshot integration for RTOSes such as pw_thread_freertos and pw_thread_embos come with built in support to report stack watermarks for threads if enabled in the kernel.
In addition, consider using asynchronous design patterns such as Active Objects which can use pw_work_queue or similar asynchronous dispatch work queues to effectively permit the sharing of stack allocations.
Buffer Sizing#
We’d be remiss not to mention the sizing of the various buffers that may exist in your application. You could consider watermarking them with pw_metric. You may also be able to adjust their servicing interval and priority, but do not forget to keep the ingress burst sizes and scheduling jitter into account.
Standard C and C++ libraries#
Toolchains are typically distributed with their preferred standard C library and standard C++ library of choice for the target platform.
Although you do not always have a choice in what standard C library and what standard C++ library is used or even how it’s compiled, stay vigilant for common sources of bloat.
Assert#
The standard C library should provides the assert
function or macro which
may be internally used even if your application does not invoke it directly.
Although this can be disabled through NDEBUG
there typically is not a
portable way of replacing the assert(condition)
implementation without
configuring and recompiling your standard C library.
However, you can consider replacing the implementation at link time with a
cheaper implementation. For example newlib-nano
, which comes with the
GNU Arm Embedded Toolchain
, often has an expensive __assert_func
implementation which uses fiprintf
to print to stderr
before invoking
abort()
. This can be replaced with a simple PW_CRASH
invocation which
can save several kilobytes in case fiprintf
isn’t used elsewhere.
One option to remove this bloat is to use --wrap
at link time to replace
these implementations. As an example in GN you could replace it with the
following BUILD.gn
file:
import("//build_overrides/pigweed.gni")
import("$dir_pw_build/target_types.gni")
# Wraps the function called by newlib's implementation of assert from stdlib.h.
#
# When using this, we suggest injecting :newlib_assert via pw_build_LINK_DEPS.
config("wrap_newlib_assert") {
ldflags = [ "-Wl,--wrap=__assert_func" ]
}
# Implements the function called by newlib's implementation of assert from
# stdlib.h which invokes __assert_func unless NDEBUG is defined.
pw_source_set("wrapped_newlib_assert") {
sources = [ "wrapped_newlib_assert.cc" ]
deps = [
"$dir_pw_assert:check",
"$dir_pw_preprocessor",
]
}
And a wrapped_newlib_assert.cc
source file implementing the wrapped assert
function:
#include "pw_assert/check.h"
#include "pw_preprocessor/compiler.h"
// This is defined by <cassert>
extern "C" PW_NO_RETURN void __wrap___assert_func(const char*,
int,
const char*,
const char*) {
PW_CRASH("libc assert() failure");
}
Ignored Finalizer and Destructor Registration#
Even if no cleanup is done during shutdown for your target, shutdown functions
such as atexit
, at_quick_exit
, and __cxa_atexit
can sometimes not be
linked out. This may be due to vendor code or perhaps using scoped statics, also
known as Meyer’s Singletons.
The registration of these destructors and finalizers may include locks, malloc, and more depending on your standard C library and its configuration.
One option to remove this bloat is to use --wrap
at link time to replace
these implementations with ones which do nothing. As an example in GN you could
replace it with the following BUILD.gn
file:
import("//build_overrides/pigweed.gni")
import("$dir_pw_build/target_types.gni")
config("wrap_atexit") {
ldflags = [
"-Wl,--wrap=atexit",
"-Wl,--wrap=at_quick_exit",
"-Wl,--wrap=__cxa_atexit",
]
}
# Implements atexit, at_quick_exit, and __cxa_atexit from stdlib.h with noop
# versions for targets which do not cleanup during exit and quick_exit.
#
# This removes any dependencies which may exist in your existing libc.
# Although this removes the ability for things such as Meyer's Singletons,
# i.e. non-global statics, to register destruction function it does not permit
# them to be garbage collected by the linker.
pw_source_set("wrapped_noop_atexit") {
sources = [ "wrapped_noop_atexit.cc" ]
}
And a wrapped_noop_atexit.cc
source file implementing the noop functions:
// These two are defined by <cstdlib>.
extern "C" int __wrap_atexit(void (*)(void)) { return 0; }
extern "C" int __wrap_at_quick_exit(void (*)(void)) { return 0; }
// This function is part of the Itanium C++ ABI, there is no header which
// provides this.
extern "C" int __wrap___cxa_atexit(void (*)(void*), void*, void*) { return 0; }
Unexpected Bloat in Disabled STL Exceptions#
The GCC
manual
recommends using -fno-exceptions
along with -fno-unwind-tables
to
disable exceptions and any associated overhead. This should replace all throw
statements with calls to abort()
.
However, what we’ve noticed with the GCC and libstdc++
is that there is a
risk that the STL will still throw exceptions when the application is compiled
with -fno-exceptions
and there is no way for you to catch them. In theory,
this is not unsafe because the unhandled exception will invoke abort()
via
std::terminate()
. This can occur because the libraries such as
libstdc++.a
may not have been compiled with -fno-exceptions
even though
your application is linked against it.
See this for more information.
Unfortunately there can be significant overhead surrounding these throw call
sites in the std::__throw_*
helper functions. These implementations such as
std::__throw_out_of_range_fmt(const char*, ...)
and
their snprintf and ergo malloc dependencies can very quickly add up to many
kilobytes of unnecessary overhead.
One option to remove this bloat while also making sure that the exceptions will
actually result in an effective abort()
is to use --wrap
at link time to
replace these implementations with ones which simply call PW_CRASH
.
As an example in GN you could replace it with the following BUILD.gn
file,
note that the mangled names must be used:
import("//build_overrides/pigweed.gni")
import("$dir_pw_build/target_types.gni")
# Wraps the std::__throw_* functions called by GNU ISO C++ Library regardless
# of whether "-fno-exceptions" is specified.
#
# When using this, we suggest injecting :wrapped_libstdc++_functexcept via
# pw_build_LINK_DEPS.
config("wrap_libstdc++_functexcept") {
ldflags = [
"-Wl,--wrap=_ZSt21__throw_bad_exceptionv",
"-Wl,--wrap=_ZSt17__throw_bad_allocv",
"-Wl,--wrap=_ZSt16__throw_bad_castv",
"-Wl,--wrap=_ZSt18__throw_bad_typeidv",
"-Wl,--wrap=_ZSt19__throw_logic_errorPKc",
"-Wl,--wrap=_ZSt20__throw_domain_errorPKc",
"-Wl,--wrap=_ZSt24__throw_invalid_argumentPKc",
"-Wl,--wrap=_ZSt20__throw_length_errorPKc",
"-Wl,--wrap=_ZSt20__throw_out_of_rangePKc",
"-Wl,--wrap=_ZSt24__throw_out_of_range_fmtPKcz",
"-Wl,--wrap=_ZSt21__throw_runtime_errorPKc",
"-Wl,--wrap=_ZSt19__throw_range_errorPKc",
"-Wl,--wrap=_ZSt22__throw_overflow_errorPKc",
"-Wl,--wrap=_ZSt23__throw_underflow_errorPKc",
"-Wl,--wrap=_ZSt19__throw_ios_failurePKc",
"-Wl,--wrap=_ZSt19__throw_ios_failurePKci",
"-Wl,--wrap=_ZSt20__throw_system_errori",
"-Wl,--wrap=_ZSt20__throw_future_errori",
"-Wl,--wrap=_ZSt25__throw_bad_function_callv",
]
}
# Implements the std::__throw_* functions called by GNU ISO C++ Library
# regardless of whether "-fno-exceptions" is specified with PW_CRASH.
pw_source_set("wrapped_libstdc++_functexcept") {
sources = [ "wrapped_libstdc++_functexcept.cc" ]
deps = [
"$dir_pw_assert:check",
"$dir_pw_preprocessor",
]
}
And a wrapped_libstdc++_functexcept.cc
source file implementing each
wrapped and mangled std::__throw_*
function:
#include "pw_assert/check.h"
#include "pw_preprocessor/compiler.h"
// These are all wrapped implementations of the throw functions provided by
// libstdc++'s bits/functexcept.h which are not needed when "-fno-exceptions"
// is used.
// std::__throw_bad_exception(void)
extern "C" PW_NO_RETURN void __wrap__ZSt21__throw_bad_exceptionv() {
PW_CRASH("std::throw_bad_exception");
}
// std::__throw_bad_alloc(void)
extern "C" PW_NO_RETURN void __wrap__ZSt17__throw_bad_allocv() {
PW_CRASH("std::throw_bad_alloc");
}
// std::__throw_bad_cast(void)
extern "C" PW_NO_RETURN void __wrap__ZSt16__throw_bad_castv() {
PW_CRASH("std::throw_bad_cast");
}
// std::__throw_bad_typeid(void)
extern "C" PW_NO_RETURN void __wrap__ZSt18__throw_bad_typeidv() {
PW_CRASH("std::throw_bad_typeid");
}
// std::__throw_logic_error(const char*)
extern "C" PW_NO_RETURN void __wrap__ZSt19__throw_logic_errorPKc(const char*) {
PW_CRASH("std::throw_logic_error");
}
// std::__throw_domain_error(const char*)
extern "C" PW_NO_RETURN void __wrap__ZSt20__throw_domain_errorPKc(const char*) {
PW_CRASH("std::throw_domain_error");
}
// std::__throw_invalid_argument(const char*)
extern "C" PW_NO_RETURN void __wrap__ZSt24__throw_invalid_argumentPKc(
const char*) {
PW_CRASH("std::throw_invalid_argument");
}
// std::__throw_length_error(const char*)
extern "C" PW_NO_RETURN void __wrap__ZSt20__throw_length_errorPKc(const char*) {
PW_CRASH("std::throw_length_error");
}
// std::__throw_out_of_range(const char*)
extern "C" PW_NO_RETURN void __wrap__ZSt20__throw_out_of_rangePKc(const char*) {
PW_CRASH("std::throw_out_of_range");
}
// std::__throw_out_of_range_fmt(const char*, ...)
extern "C" PW_NO_RETURN void __wrap__ZSt24__throw_out_of_range_fmtPKcz(
const char*, ...) {
PW_CRASH("std::throw_out_of_range");
}
// std::__throw_runtime_error(const char*)
extern "C" PW_NO_RETURN void __wrap__ZSt21__throw_runtime_errorPKc(
const char*) {
PW_CRASH("std::throw_runtime_error");
}
// std::__throw_range_error(const char*)
extern "C" PW_NO_RETURN void __wrap__ZSt19__throw_range_errorPKc(const char*) {
PW_CRASH("std::throw_range_error");
}
// std::__throw_overflow_error(const char*)
extern "C" PW_NO_RETURN void __wrap__ZSt22__throw_overflow_errorPKc(
const char*) {
PW_CRASH("std::throw_overflow_error");
}
// std::__throw_underflow_error(const char*)
extern "C" PW_NO_RETURN void __wrap__ZSt23__throw_underflow_errorPKc(
const char*) {
PW_CRASH("std::throw_underflow_error");
}
// std::__throw_ios_failure(const char*)
extern "C" PW_NO_RETURN void __wrap__ZSt19__throw_ios_failurePKc(const char*) {
PW_CRASH("std::throw_ios_failure");
}
// std::__throw_ios_failure(const char*, int)
extern "C" PW_NO_RETURN void __wrap__ZSt19__throw_ios_failurePKci(const char*,
int) {
PW_CRASH("std::throw_ios_failure");
}
// std::__throw_system_error(int)
extern "C" PW_NO_RETURN void __wrap__ZSt20__throw_system_errori(int) {
PW_CRASH("std::throw_system_error");
}
// std::__throw_future_error(int)
extern "C" PW_NO_RETURN void __wrap__ZSt20__throw_future_errori(int) {
PW_CRASH("std::throw_future_error");
}
// std::__throw_bad_function_call(void)
extern "C" PW_NO_RETURN void __wrap__ZSt25__throw_bad_function_callv() {
PW_CRASH("std::throw_bad_function_call");
}
Compiler and Linker Optimizations#
Compiler Optimization Options#
Don’t forget to configure your compiler to optimize for size if needed. With
Clang this is -Oz
and with GCC this can be done via -Os
. The GN
toolchains provided through pw_toolchain which are optimized for
size are suffixed with *_size_optimized
.
Garbage collect function and data sections#
By default the linker will place all functions in an object within the same
linker “section” (e.g. .text
). With Clang and GCC you can use
-ffunction-sections
and -fdata-sections
to use a unique “section” for
each object (e.g. .text.do_foo_function
). This permits you to pass
--gc-sections
to the linker to cull any unused sections which were not
referenced.
To see what sections were garbage collected you can pass --print-gc-sections
to the linker so it prints out what was removed.
The GN toolchains provided through pw_toolchain are configured to do this by default.
Function Inlining#
Don’t forget to expose trivial functions such as member accessors as inline definitions in the header. The compiler and linker can make the trade-off on whether the function should be actually inlined or not based on your optimization settings, however this at least gives it the option. Note that LTO can inline functions which are not defined in headers.
We stand by the Google style guide to recommend considering this for simple functions which are 10 lines or less.
Link Time Optimization (LTO)#
Summary: LTO can decrase your binary size, at a cost: LTO makes debugging harder, interacts poorly with linker scripts, and makes crash reports less informative. We advise only enabling LTO when absolutely necessary.
Link time optimization (LTO) moves some optimizations from the individual compile steps to the final link step, to enable optimizing across translation unit boundaries.
LTO can both increase performance and reduce binary size for embedded projects. This appears to be a strict improvement; and one might think enabling LTO at all times is the best approach. However, this is not the case; in practice, LTO is a trade-off.
LTO benefits
Reduces binary size - When compiling with size-shrinking flags like
-Oz
, some function call overhead can be eliminated, and code paths might be eliminated by the optimizer after inlining. This can include critical abstraction removal like devirtualization.Improves performance - When code is inlined, the optimizer can better reduce the number of instructions. When code is smaller, the instruction cache has better hit ratio leading to better performance. In some cases, entire function calls are eliminated.
LTO costs
LTO interacts poorly with linker scripts - Production embedded projects often have complicated linker scripts to control the physical layout of code and data on the device. For example, you may want to put performance critical audio codec functions into the fast tightly coupled (TCM) memory region. However, LTO can interact with linker script requirements in strange ways, like inappropriately inlining code that was manually placed into other functions in the wrong region; leading to hard-to-understand bugs.
Debugging LTO binaries is harder - LTO increases the differences between the machine code and the source code. This makes stepping through source code in a debugger confusing, since the instruction pointer can hop around in confusing ways.
Crash reports for LTO binaries can be misleading - Just as with debugging, LTO’d binaries can produce confusing stacks in crash reports.
LTO significantly increases build times - The compilation model is different when LTO is enabled, since individual translation unit compilations (.cc –> .o) files now produce LLVM- or GCC- IR instead of native machine code; machine code is only generated at the link phase. This makes the final link step take significantly longer. Since any source changes will result in a link step, developer velocity is reduced due to the slow compile time.
How to enable LTO#
On GCC and Clang LTO is enabled by passing -flto
to both the compiler
and the linker. On GCC -fdevirtualize-at-ltrans
enables more aggressive
devirtualization.
Our recommendation#
Disable LTO unless absolutely necessary; e.g. due to lack of space.
When enabling LTO, carefully and thoroughly test the resulting binary.
Check that crash reports are still useful under LTO for your product.
Disabling Scoped Static Initialization Locks#
C++11 requires that scoped static objects are initialized in a thread-safe manner. This also means that scoped statics, i.e. Meyer’s Singletons, be thread-safe. Unfortunately this rarely is the case on embedded targets. For example with GCC on an ARM Cortex M device if you test for this you will discover that instead the program crashes as reentrant initialization is detected through the use of guard variables.
With GCC and Clang, -fno-threadsafe-statics
can be used to remove the global
lock which often does not work for embedded targets. Note that this leaves the
guard variables in place which ensure that reentrant initialization continues to
crash.
Be careful when using this option in case you are relying on threadsafe initialization of statics and the global locks were functional for your target.
Triaging Unexpectedly Linked In Functions#
Lastly as a tip if you cannot figure out why a function is being linked in you can consider:
Using
--wrap
with the linker to remove the implementation, resulting in a link failure which typically calls out which calling function can no longer be linked.With GCC, you can use
-fcallgraph-info
to visualize or otherwise inspect the callgraph to figure out who is calling what.Sometimes symbolizing the address can resolve what a function is for. For example if you are using
newlib-nano
along with-fno-use-cxa-atexit
, scoped static destructors are prefixed__tcf_*
. To figure out object these destructor functions are associated with, you can usellvm-symbolizer
oraddr2line
and these will often print out the related object’s name.
Sorting input sections by alignment#
Linker scripts often contain input section wildcard patterns to specify which input sections should be placed in each output section. For example, say a linker script contains a sections command like the following:
.text : { *(.init*) *(.text*) }
By default, the GCC and Clang linkers will place symbols matched by each wildcard pattern in the order they are seen at link-time. The linker will insert padding bytes as necessary to satisfy the alignment requirements of each symbol.
The GCC and Clang linkers allow one to first sort matched symbols for each
wildcard pattern by alignment with the SORT_BY_ALIGNMENT
keyword, which can
reduce the amount of necessary padding bytes and save memory. This can be used
to enable alignment sort on a per-pattern basis like so:
.text : { *(SORT_BY_ALIGNMENT(.init*)) *(SORT_BY_ALIGNMENT(.text*)) }
This keyword can be applied globally to all wildcard matches in your linker
script by passing the --sort-section=alignment
option to the linker.
See the ld manual for more information.