0105: Nested Tokens and Tokenized Log Arguments#
Status: Open for Comments Intent Approved Last Call Accepted Rejected
Proposal Date: 2023-07-10
CL: pwrev/154190
Author: Gwyneth Chen
Facilitator: Wyatt Hepler
Summary#
This SEED describes a number of extensions to the pw_tokenizer and pw_log_tokenized modules to improve support for nesting tokens and add facilities for tokenizing arguments to logs such as strings or and enums. This SEED primarily addresses C/C++ tokenization and Python/C++ detokenization.
Motivation#
Currently, pw_tokenizer
and pw_log_tokenized
enable devices with limited
memory to store long log format strings as hashed 32-bit tokens. When logs are
moved off-device, host tooling can recover the full logs using token databases
that were created when building the device image. However, logs may still have
runtime string arguments that are stored and transferred 1:1 without additional
encoding. This SEED aims to extend tokenization to these arguments to further
reduce the weight of logging for embedded applications.
The proposed changes affect both the tokenization module itself and the logging facilities built on top of tokenization.
Proposal#
Logging enums such as pw::Status
is one common special case where
tokenization is particularly appropriate: enum values are conceptually
already tokens mapping to their names, assuming no duplicate values. Logging
enums frequently entails creating functions and string names that occupy space
exclusively for logging purposes, which this proposal seeks to mitigate.
Here, pw::Status::NotFound()
is presented as an illustrative example of
the several transformations that strings undergo during tokenization and
detokenization, further complicated in the proposed design by nested tokens.
|
|
|
16170adf, "Status: ${pw::Status}#%08x", "" 5 , "PW_STATUS_NOT_FOUND" , "pw::Status" |
|
|
|
|
|
|
Compared to log tokenization without nesting, string literals in token
database entries may not be identical to what is typed in source code due
to the use of macros and preprocessor string concatenation. The
detokenizer also takes an additional step to recursively detokenize any
nested tokens. In exchange for this added complexity, nested enum tokenization
allows us to gain the readability of logging value names with zero additional
runtime space or performance cost compared to logging the integral values
directly with pw_log_tokenized
.
Note
Without nested enum token support, users can select either readability or reduced binary and transmission size, but not easily both:
Raw integers |
String names |
|
---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Tokenization (C/C++)#
The pw_log_tokenized
module exposes a set of macros for creating and
formatting nested tokens. Within format strings in the source code, tokens
are specified using function-like PRI-style macros. These can be used to
encode static information like the token domain or a numeric base encoding
and are macro-expanded to string literals that are concatenated with the
rest of the format string during preprocessing. Since pw_log
generally
uses printf syntax, only bases 8, 10, and 16 are supported for integer token
arguments via %[odiuxX]
.
The provided macros enforce the token specifier syntax and keep the argument
types in sync when switching between other pw_log
backends like
pw_log_basic
. These macros for basic usage are as follows:
PW_LOG_TOKEN
andPW_LOG_TOKEN_EXPR
are used to tokenize string args.PW_LOG_TOKEN_FMT
is used inside the format string to specify a token arg.PW_LOG_TOKEN_TYPE
is used if the type of a tokenized arg needs to be referenced, e.g. as aToString
function return type.
#include "pw_log/log.h"
#include "pw_log/tokenized_args.h"
// token with default options base-16 and empty domain
// token database literal: "The sun will come out $#%08x!"
PW_LOG("The sun will come out " PW_LOG_TOKEN_FMT() "!", PW_LOG_TOKEN_EXPR("tomorrow"))
// after detokenization: "The sun will come out tomorrow!"
Additional macros are also provided specifically for enum handling. The
TOKENIZE_ENUM
macro creates ELF token database entries for each enum
value with the specified token domain to prevent token collision between
multiple tokenized enums. This macro is kept separate from the enum
definition to allow things like tokenizing a preexisting enum defined in an
external dependency.
// enums
namespace foo {
enum class Color { kRed, kGreen, kBlue };
// syntax TBD
TOKENIZE_ENUM(
foo::Color,
kRed,
kGreen,
kBlue
)
} // namespace foo
void LogColor(foo::Color color) {
// token database literal:
// "Color: [${foo::Color}10#%010d]"
PW_LOG("Color: [" PW_LOG_ENUM_FMT(foo::Color, 10) "]", color)
// after detokenization:
// e.g. "Color: kRed"
}
Nested Base64 tokens
PW_LOG_TOKEN_FMT
can accept 64 as the base encoding for an argument, in
which case the argument should be a pre-encoded Base64 string argument
(e.g. QAzF39==
). However, this should be avoided when possible to
maximize space savings. Fully-formatted Base64 including the token prefix
may also be logged with %s
as before.
Detokenization (Python)#
Detokenizer.detokenize
in Python (Detokenizer::Detokenize
in C++)
will automatically recursively detokenize tokens of all known formats rather
than requiring a separate call to detokenize_base64
or similar.
To support detokenizing domain-specific tokens, token databases support multiple
domains, and database.py create
will build a database with tokens from all
domains by default. Specifying a domain during database creation will cause
that domain to be treated as the default.
When detokenization fails, tokens appear as-is in logs. If the detokenizer has
the show_errors
option set to True
, error messages may be printed
inline following the raw token.
Tokens#
Many details described here are provided via the PW_LOG_TOKEN_FMT
macro, so
users should typically not be manually formatting tokens. However, if
detokenization fails for any reason, tokens will appear with the following
format in the final logs and should be easily recognizable.
Nested tokens have the following structure in partially detokenized logs (transformation stage 4):
$[{DOMAIN}][BASE#]TOKEN
The $
is a common prefix required for all nested tokens. It is possible to
configure a different common prefix if necessary, but using the default $
character is strongly recommended.
|
Specifies the token domain. If this option is omitted, the default (empty) domain is assumed. |
|
Defines the numeric base encoding of the token. Accepted values are 8,
10, 16, and 64. If the hash symbol This option may be expanded to support other bases in the future. |
|
The numeric representation of the token in the given base encoding. All
encodings except Base64 are left-padded with zeroes to the maximum width
of a 32-bit integer in the given base. Base64 data may additionally encode
string arguments for the detokenized token, and therefore does not have a
maximum width. This is automatically handled by |
When used in conjunction with pw_log_tokenized
, the token prefix (including
any domain and base specifications) is tokenized as part of the log format
string and therefore incurs zero additional memory or transmission cost over
that of the original format string. Over the wire, tokens in bases 8, 10, and
16 are transmitted as varint-encoded integers up to 5 bytes in size. Base64
tokens continue to be encoded as strings.
Warning
Tokens do not have a terminating character in general, which is why we require them to be formatted with fixed width. Otherwise, following them immediately with alphanumeric characters valid in their base encoding will cause detokenization errors.
Recognizing raw nested tokens in strings
When a string is fully detokenized, there should no longer be any indication of tokenization in the final result, e.g. detokenized logs should read the same as plain string logs. However, if nested tokens cannot be detokenized for any reason, they will appear in their raw form as below:
// Base64 token with no arguments and empty domain
$QA19pfEQ
// Base-10 token
$10#0086025943
// Base-16 token with specified domain
${foo_namespace::MyEnum}#0000001A
// Base64 token with specified domain
${bar_namespace::MyEnum}QAQQQQ==
Problem investigation#
Complex embedded device projects are perpetually seeking more RAM. For longer descriptive string arguments, even just a handful can take up hundreds of bytes that are frequently exclusively for logging purposes, without any impact on function.
One of the most common potential use cases is for logging enum values. Inspection of one project revealed that enums accounted for some 90% of the string log arguments. We have encountered instances where, to save space, developers have avoided logging descriptive names in favor of raw enum values, forcing readers of logs look up or memorize the meanings of each number. Like with log format strings, we do know the set of possible string values that might be emitted in the final logs, so they should be able to be extracted into a token database at compile time.
Another major challenge overall is maintaining a user interface
that is easy to understand and use. The current primary interface through
pw_log
provides printf-style formatting, which is familiar and succinct
for basic applications.
We also have to contend with the interchangeable backends of pw_log
. The
pw_log
facade is intended as an opaque interface layer; adding syntax
specifically for tokenized logging will break this abstraction barrier. Either
this additional syntax would be ignored by other backends, or it might simply
be incompatible (e.g. logging raw integer tokens instead of strings).
Pigweed already supports one form of nested tokens via Base64 encoding. Base64
tokens begin with '$'
, followed by Base64-encoded data, and may be padded
with one or two trailing '='
symbols. The Python
Detokenizer.detokenize_base64
method recursively detokenizes Base64 by
running a regex replacement on the formatted results of each iteration. Base64
is not merely a token format, however; it can encode any binary data in a text
format at the cost of reduced efficiency. Therefore, Base64 tokens may include
not only a database token that may detokenize to a format string but also
binary-encoded arguments. Other token types are not expected to include this
additional argument data.
Detailed design#
Tokenization#
pw_tokenizer
and pw_log_tokenized
already provide much of the necessary
functionality to support tokenized arguments. The proposed API is fully
backward-compatible with non-nested tokenized logging.
Token arguments are indicated in log format strings via PRI-style macros that
are exposed by a new pw_log/tokenized_args.h
header. PW_LOG_TOKEN_FMT
supplies the $
token prefix, brackets around the domain, the base specifier,
and the printf-style specifier including padding and width, i.e. %011o
for
base-8, %010u
for base-10, and %08X
for base-16.
For free-standing string arguments such as those where the literals are defined
in the log statements themselves, tokenization is performed with macros from
pw_log/tokenized_args.h
. With the tokenized logging backend, these macros
simply alias the corresponding PW_TOKENIZE
macros, but they also revert to
basic string formatting for other backends. This is achieved by placing an
empty header file in the local public_overrides
directory of
pw_log_tokenized
and checking for it in pw_log/tokenized_args.h
using
the __has_include
directive.
For variable string arguments, the API is split across locations. The string literals are tokenized wherever they are defined, and the string format macros appear in the log format strings corresponding to those string arguments.
When tokens use non-default domains, additional work may be required to create the domain name and store associated tokens in the ELF.
Enum Tokenization#
We use existing pw_tokenizer
utilities to record the raw enum values as
tokens corresponding to their string names in the ELF. There is no change
required for the backend implementation; we simply skip the token calculation
step, since we already have a value to use, and specifying a token domain is
generally required to isolate multiple enums from token collision.
For ease of use, we can also provide a macro that wraps the enum value list and encapsulates the recording of each token value-string pair in the ELF.
When actually logging the values, users pass the enum type name as the domain
to format specifier macro PW_LOG_TOKEN()
, and the enum values can be
passed as-is to PW_LOG
(casting to integers as necessary for scoped enums).
Since integers are varint-encoded over the wire, this will only require a
single byte for most enums.
Logging pw::status
Note that while this immediately reduces transmission size, the code
space occupied by the string names in pw::Status::str()
cannot be
recovered unless an entire project is converted to log pw::Status
as tokens.
#include "pw_log/log.h"
#include "pw_log/tokenized_args.h"
#include "pw_status/status.h"
pw::Status status = pw::Status::NotFound();
// "pw::Status: ${pw::Status}#%08d"
PW_LOG("pw::Status: " PW_LOG_TOKEN(pw::Status), status.code)
// "pw::Status: NOT_FOUND"
Since the token mapping entries in the ELF are optimized out of the final
binary, the enum domains are tokenized away as part of the log format strings,
and we don’t need to store separate tokens for each enum value, this addition
to the API would would provide enum value names in logs with zero additional
RAM cost. Compared to logging strings with ToString
-style functions, we
save space on the string names as well as the functions themselves.
Token Database#
Token databases will be expanded to include a column for domains, so that multiple domains can be encompassed in a single database rather than requiring separate databases for each domain. This is important because domains are being used to categorize tokens within a single project, rather than merely keeping separate projects distinct from each other. When creating a database from an ELF, a domain may be specified as the default domain instead of the empty domain. A list of domains or path to a file with a list of domains may also separately be specified to define which domains are to be included in the database; all domains are now included by default.
When accessing a token database, both a domain and token value may be specified to access specific values. If a domain is not specified, the default domain will be assumed, retaining the same behavior as before.
Detokenization#
Detokenization is relatively straightforward. When the detokenizer is called, it will first detokenize and format the top-level token and binary argument data. The detokenizer will then find and replace nested tokens in the resulting formatted string, then rescan the result for more nested tokens up to a fixed number of rescans.
For each token type or format, pw_tokenizer
defines a regular expression to
match the expected formatted output token and a helper function to convert a
token from a particular format to its mapped value. The regular expressions for
each token type are combined into a single regex that matches any one of the
formats. At each recursive step for every match, each detokenization format
will be attempted, stopping at the first successful token type and then
recursively replacing all nested tokens in the result. Only full data encoding-
type tokens like Base64 will also require string/argument formatting as part of
the recursive step.
For non-Base64 tokens, a token’s base encoding as specified by BASE#
determines its set of permissible alphanumeric characters and the
maximum token width for regex matching.
If nested detokenization fails for any reason, the formatted token will be
printed as-is in the output logs. If show_errors
is true for the
detokenizer, errors will appear in parentheses immediately following the
token. Supported errors include:
(token collision)
(missing database)
(token not found)
Alternatives#
Protobuf-based Tokenization#
Tokenization may be expanded to function on structured data via protobufs. This can be used to make logging more flexible, as all manner of compile-time metadata can be freely attached to log arguments at effectively no cost. This will most likely involve a separate build process to generate and tokenize partially-populated protos and will significantly change the user API. It will also be a large break from the existing process in implementation, as the current system relies only on existing C preprocessor and C++ constexpr tricks to function.
In this model, the token domain would likely be a fully-qualified namespace for or path to the proto definition.
Implementing this approach also requires a method of passing ordered arguments to a partially-filled detokenized protobuf in a manner similar to printf-style string formatting, so that argument data can be efficiently encoded and transmitted alongside the protobuf’s token, and the arguments to a particular proto can be disambiguated from arguments to the rest of a log statement.
This approach will also most likely preclude plain string logging as is
currently supported by pw_log
, as the implementations diverge dramatically.
However, if pursued, this would likely be made the default logging schema
across all platforms, including host devices.
Custom Detokenization#
Theoretically, individual projects could implement their own regex replacement schemes on top of Pigweed’s detokenizer, allowing them to more flexibly define complex relationships between logged tokens via custom log format string syntax. However, Pigweed should provide utilities for nested tokenization in common cases such as logging enums.
The changes proposed do not preclude additional custom detokenization schemas if absolutely necessary, and such practices do not appear to have been popular thus far in any case.
Open questions#
Missing API definitions:
Updated APIs for creating and accessing token databases with multiple domains
Python nested tokenization
C++ nested detokenization