0105: Nested Tokens and Tokenized Log Arguments#

Status: Open for Comments Intent Approved Last Call Accepted Rejected

Proposal Date: 2023-07-10

CL: pwrev/154190

Author: Gwyneth Chen

Facilitator: Wyatt Hepler

Summary#

This SEED describes a number of extensions to the pw_tokenizer and pw_log_tokenized modules to improve support for nesting tokens and add facilities for tokenizing arguments to logs such as strings or and enums. This SEED primarily addresses C/C++ tokenization and Python/C++ detokenization.

Motivation#

Currently, pw_tokenizer and pw_log_tokenized enable devices with limited memory to store long log format strings as hashed 32-bit tokens. When logs are moved off-device, host tooling can recover the full logs using token databases that were created when building the device image. However, logs may still have runtime string arguments that are stored and transferred 1:1 without additional encoding. This SEED aims to extend tokenization to these arguments to further reduce the weight of logging for embedded applications.

The proposed changes affect both the tokenization module itself and the logging facilities built on top of tokenization.

Proposal#

Logging enums such as pw::Status is one common special case where tokenization is particularly appropriate: enum values are conceptually already tokens mapping to their names, assuming no duplicate values. Logging enums frequently entails creating functions and string names that occupy space exclusively for logging purposes, which this proposal seeks to mitigate. Here, pw::Status::NotFound() is presented as an illustrative example of the several transformations that strings undergo during tokenization and detokenization, further complicated in the proposed design by nested tokens.

Enum Tokenization/Detokenization Phases#
  1. Source code

PW_LOG("Status: " PW_LOG_ENUM_FMT(pw::Status), status.code())

  1. Token database entries (token, string, domain)

16170adf, "Status: ${pw::Status}#%08x", ""
5       , "PW_STATUS_NOT_FOUND"       , "pw::Status"
  1. Wire format

df 0a 17 16 0a (5 bytes)

  1. Top-level detokenized and formatted

"Status: ${pw::Status}#00000005"

  1. Fully detokenized

"Status: PW_STATUS_NOT_FOUND"

Compared to log tokenization without nesting, string literals in token database entries may not be identical to what is typed in source code due to the use of macros and preprocessor string concatenation. The detokenizer also takes an additional step to recursively detokenize any nested tokens. In exchange for this added complexity, nested enum tokenization allows us to gain the readability of logging value names with zero additional runtime space or performance cost compared to logging the integral values directly with pw_log_tokenized.

Note

Without nested enum token support, users can select either readability or reduced binary and transmission size, but not easily both:

Raw integers

String names

  1. Source code

PW_LOG("Status: %x" , status.code())

PW_LOG("Status: %s" , pw_StatusString(status))

  1. Token database entries (token, string, domain)

03a83461, "Status: %x", ""

069c3ef0, "Status: %s", ""

  1. Wire format

61 34 a8 03 0a (5 bytes)

f0 3e 9c 06 09 4e 4f 54 5f 46 4f 55 4e 44 (14 bytes)

  1. Top-level detokenized and formatted

"Status: 5"

"Status: PW_STATUS_NOT_FOUND"

  1. Fully detokenized

"Status: 5"

"Status: PW_STATUS_NOT_FOUND"

Tokenization (C/C++)#

The pw_log_tokenized module exposes a set of macros for creating and formatting nested tokens. Within format strings in the source code, tokens are specified using function-like PRI-style macros. These can be used to encode static information like the token domain or a numeric base encoding and are macro-expanded to string literals that are concatenated with the rest of the format string during preprocessing. Since pw_log generally uses printf syntax, only bases 8, 10, and 16 are supported for integer token arguments via %[odiuxX].

The provided macros enforce the token specifier syntax and keep the argument types in sync when switching between other pw_log backends like pw_log_basic. These macros for basic usage are as follows:

  • PW_LOG_TOKEN and PW_LOG_TOKEN_EXPR are used to tokenize string args.

  • PW_LOG_TOKEN_FMT is used inside the format string to specify a token arg.

  • PW_LOG_TOKEN_TYPE is used if the type of a tokenized arg needs to be referenced, e.g. as a ToString function return type.

#include "pw_log/log.h"
#include "pw_log/tokenized_args.h"

// token with default options base-16 and empty domain
// token database literal: "The sun will come out $#%08x!"
PW_LOG("The sun will come out " PW_LOG_TOKEN_FMT() "!", PW_LOG_TOKEN_EXPR("tomorrow"))
// after detokenization: "The sun will come out tomorrow!"

Additional macros are also provided specifically for enum handling. The TOKENIZE_ENUM macro creates ELF token database entries for each enum value with the specified token domain to prevent token collision between multiple tokenized enums. This macro is kept separate from the enum definition to allow things like tokenizing a preexisting enum defined in an external dependency.

// enums
namespace foo {

  enum class Color { kRed, kGreen, kBlue };

  // syntax TBD
  TOKENIZE_ENUM(
    foo::Color,
    kRed,
    kGreen,
    kBlue
  )

} // namespace foo

void LogColor(foo::Color color) {
  // token database literal:
  // "Color: [${foo::Color}10#%010d]"
  PW_LOG("Color: [" PW_LOG_ENUM_FMT(foo::Color, 10) "]", color)
  // after detokenization:
  // e.g. "Color: kRed"
}

Nested Base64 tokens

PW_LOG_TOKEN_FMT can accept 64 as the base encoding for an argument, in which case the argument should be a pre-encoded Base64 string argument (e.g. QAzF39==). However, this should be avoided when possible to maximize space savings. Fully-formatted Base64 including the token prefix may also be logged with %s as before.

Detokenization (Python)#

Detokenizer.detokenize in Python (Detokenizer::Detokenize in C++) will automatically recursively detokenize tokens of all known formats rather than requiring a separate call to detokenize_base64 or similar.

To support detokenizing domain-specific tokens, token databases support multiple domains, and database.py create will build a database with tokens from all domains by default. Specifying a domain during database creation will cause that domain to be treated as the default.

When detokenization fails, tokens appear as-is in logs. If the detokenizer has the show_errors option set to True, error messages may be printed inline following the raw token.

Tokens#

Many details described here are provided via the PW_LOG_TOKEN_FMT macro, so users should typically not be manually formatting tokens. However, if detokenization fails for any reason, tokens will appear with the following format in the final logs and should be easily recognizable.

Nested tokens have the following structure in partially detokenized logs (transformation stage 4):

$[{DOMAIN}][BASE#]TOKEN

The $ is a common prefix required for all nested tokens. It is possible to configure a different common prefix if necessary, but using the default $ character is strongly recommended.

Options#

{DOMAIN}

Specifies the token domain. If this option is omitted, the default (empty) domain is assumed.

BASE#

Defines the numeric base encoding of the token. Accepted values are 8, 10, 16, and 64. If the hash symbol # is used without specifying a number, the base is assumed to be 16. If the base option is omitted entirely, the base defaults to 64 for backward compatibility. All encodings except Base64 are not case sensitive.

This option may be expanded to support other bases in the future.

TOKEN (required)

The numeric representation of the token in the given base encoding. All encodings except Base64 are left-padded with zeroes to the maximum width of a 32-bit integer in the given base. Base64 data may additionally encode string arguments for the detokenized token, and therefore does not have a maximum width. This is automatically handled by PW_LOG_TOKEN_FMT for supported bases.

When used in conjunction with pw_log_tokenized, the token prefix (including any domain and base specifications) is tokenized as part of the log format string and therefore incurs zero additional memory or transmission cost over that of the original format string. Over the wire, tokens in bases 8, 10, and 16 are transmitted as varint-encoded integers up to 5 bytes in size. Base64 tokens continue to be encoded as strings.

Warning

Tokens do not have a terminating character in general, which is why we require them to be formatted with fixed width. Otherwise, following them immediately with alphanumeric characters valid in their base encoding will cause detokenization errors.

Recognizing raw nested tokens in strings

When a string is fully detokenized, there should no longer be any indication of tokenization in the final result, e.g. detokenized logs should read the same as plain string logs. However, if nested tokens cannot be detokenized for any reason, they will appear in their raw form as below:

// Base64 token with no arguments and empty domain
$QA19pfEQ

// Base-10 token
$10#0086025943

// Base-16 token with specified domain
${foo_namespace::MyEnum}#0000001A

// Base64 token with specified domain
${bar_namespace::MyEnum}QAQQQQ==

Problem investigation#

Complex embedded device projects are perpetually seeking more RAM. For longer descriptive string arguments, even just a handful can take up hundreds of bytes that are frequently exclusively for logging purposes, without any impact on function.

One of the most common potential use cases is for logging enum values. Inspection of one project revealed that enums accounted for some 90% of the string log arguments. We have encountered instances where, to save space, developers have avoided logging descriptive names in favor of raw enum values, forcing readers of logs look up or memorize the meanings of each number. Like with log format strings, we do know the set of possible string values that might be emitted in the final logs, so they should be able to be extracted into a token database at compile time.

Another major challenge overall is maintaining a user interface that is easy to understand and use. The current primary interface through pw_log provides printf-style formatting, which is familiar and succinct for basic applications.

We also have to contend with the interchangeable backends of pw_log. The pw_log facade is intended as an opaque interface layer; adding syntax specifically for tokenized logging will break this abstraction barrier. Either this additional syntax would be ignored by other backends, or it might simply be incompatible (e.g. logging raw integer tokens instead of strings).

Pigweed already supports one form of nested tokens via Base64 encoding. Base64 tokens begin with '$', followed by Base64-encoded data, and may be padded with one or two trailing '=' symbols. The Python Detokenizer.detokenize_base64 method recursively detokenizes Base64 by running a regex replacement on the formatted results of each iteration. Base64 is not merely a token format, however; it can encode any binary data in a text format at the cost of reduced efficiency. Therefore, Base64 tokens may include not only a database token that may detokenize to a format string but also binary-encoded arguments. Other token types are not expected to include this additional argument data.

Detailed design#

Tokenization#

pw_tokenizer and pw_log_tokenized already provide much of the necessary functionality to support tokenized arguments. The proposed API is fully backward-compatible with non-nested tokenized logging.

Token arguments are indicated in log format strings via PRI-style macros that are exposed by a new pw_log/tokenized_args.h header. PW_LOG_TOKEN_FMT supplies the $ token prefix, brackets around the domain, the base specifier, and the printf-style specifier including padding and width, i.e. %011o for base-8, %010u for base-10, and %08X for base-16.

For free-standing string arguments such as those where the literals are defined in the log statements themselves, tokenization is performed with macros from pw_log/tokenized_args.h. With the tokenized logging backend, these macros simply alias the corresponding PW_TOKENIZE macros, but they also revert to basic string formatting for other backends. This is achieved by placing an empty header file in the local public_overrides directory of pw_log_tokenized and checking for it in pw_log/tokenized_args.h using the __has_include directive.

For variable string arguments, the API is split across locations. The string literals are tokenized wherever they are defined, and the string format macros appear in the log format strings corresponding to those string arguments.

When tokens use non-default domains, additional work may be required to create the domain name and store associated tokens in the ELF.

Enum Tokenization#

We use existing pw_tokenizer utilities to record the raw enum values as tokens corresponding to their string names in the ELF. There is no change required for the backend implementation; we simply skip the token calculation step, since we already have a value to use, and specifying a token domain is generally required to isolate multiple enums from token collision.

For ease of use, we can also provide a macro that wraps the enum value list and encapsulates the recording of each token value-string pair in the ELF.

When actually logging the values, users pass the enum type name as the domain to format specifier macro PW_LOG_TOKEN(), and the enum values can be passed as-is to PW_LOG (casting to integers as necessary for scoped enums). Since integers are varint-encoded over the wire, this will only require a single byte for most enums.

Logging pw::status

Note that while this immediately reduces transmission size, the code space occupied by the string names in pw::Status::str() cannot be recovered unless an entire project is converted to log pw::Status as tokens.

#include "pw_log/log.h"
#include "pw_log/tokenized_args.h"
#include "pw_status/status.h"

pw::Status status = pw::Status::NotFound();

// "pw::Status: ${pw::Status}#%08d"
PW_LOG("pw::Status: " PW_LOG_TOKEN(pw::Status), status.code)
// "pw::Status: NOT_FOUND"

Since the token mapping entries in the ELF are optimized out of the final binary, the enum domains are tokenized away as part of the log format strings, and we don’t need to store separate tokens for each enum value, this addition to the API would would provide enum value names in logs with zero additional RAM cost. Compared to logging strings with ToString-style functions, we save space on the string names as well as the functions themselves.

Token Database#

Token databases will be expanded to include a column for domains, so that multiple domains can be encompassed in a single database rather than requiring separate databases for each domain. This is important because domains are being used to categorize tokens within a single project, rather than merely keeping separate projects distinct from each other. When creating a database from an ELF, a domain may be specified as the default domain instead of the empty domain. A list of domains or path to a file with a list of domains may also separately be specified to define which domains are to be included in the database; all domains are now included by default.

When accessing a token database, both a domain and token value may be specified to access specific values. If a domain is not specified, the default domain will be assumed, retaining the same behavior as before.

Detokenization#

Detokenization is relatively straightforward. When the detokenizer is called, it will first detokenize and format the top-level token and binary argument data. The detokenizer will then find and replace nested tokens in the resulting formatted string, then rescan the result for more nested tokens up to a fixed number of rescans.

For each token type or format, pw_tokenizer defines a regular expression to match the expected formatted output token and a helper function to convert a token from a particular format to its mapped value. The regular expressions for each token type are combined into a single regex that matches any one of the formats. At each recursive step for every match, each detokenization format will be attempted, stopping at the first successful token type and then recursively replacing all nested tokens in the result. Only full data encoding- type tokens like Base64 will also require string/argument formatting as part of the recursive step.

For non-Base64 tokens, a token’s base encoding as specified by BASE# determines its set of permissible alphanumeric characters and the maximum token width for regex matching.

If nested detokenization fails for any reason, the formatted token will be printed as-is in the output logs. If show_errors is true for the detokenizer, errors will appear in parentheses immediately following the token. Supported errors include:

  • (token collision)

  • (missing database)

  • (token not found)

Alternatives#

Protobuf-based Tokenization#

Tokenization may be expanded to function on structured data via protobufs. This can be used to make logging more flexible, as all manner of compile-time metadata can be freely attached to log arguments at effectively no cost. This will most likely involve a separate build process to generate and tokenize partially-populated protos and will significantly change the user API. It will also be a large break from the existing process in implementation, as the current system relies only on existing C preprocessor and C++ constexpr tricks to function.

In this model, the token domain would likely be a fully-qualified namespace for or path to the proto definition.

Implementing this approach also requires a method of passing ordered arguments to a partially-filled detokenized protobuf in a manner similar to printf-style string formatting, so that argument data can be efficiently encoded and transmitted alongside the protobuf’s token, and the arguments to a particular proto can be disambiguated from arguments to the rest of a log statement.

This approach will also most likely preclude plain string logging as is currently supported by pw_log, as the implementations diverge dramatically. However, if pursued, this would likely be made the default logging schema across all platforms, including host devices.

Custom Detokenization#

Theoretically, individual projects could implement their own regex replacement schemes on top of Pigweed’s detokenizer, allowing them to more flexibly define complex relationships between logged tokens via custom log format string syntax. However, Pigweed should provide utilities for nested tokenization in common cases such as logging enums.

The changes proposed do not preclude additional custom detokenization schemas if absolutely necessary, and such practices do not appear to have been popular thus far in any case.

Open questions#

Missing API definitions:

  • Updated APIs for creating and accessing token databases with multiple domains

  • Python nested tokenization

  • C++ nested detokenization