pw_tokenizer API reference#

pw_tokenizer: Compress strings to shrink logs by +75%

Configuration#

Defines

PW_TOKENIZER_CFG_ARG_TYPES_SIZE_BYTES#

For a tokenized string with arguments, the types of the arguments are encoded in either 4 bytes (uint32_t) or 8 bytes (uint64_t). 4 bytes supports up to 14 tokenized string arguments; 8 bytes supports up to 29 arguments. Using 8 bytes increases code size for 32-bit machines.

Argument types are encoded two bits per argument, in little-endian order. The 4 or 6 least-significant bits, respectively, store the number of arguments, while the remaining bits encode the argument types.

PW_TOKENIZER_CFG_C_HASH_LENGTH#

Maximum number of characters to hash in C. In C code, strings shorter than this length are treated as if they were zero-padded up to the length. Strings that are the same length and share a common prefix longer than this value hash to the same value. Increasing PW_TOKENIZER_CFG_C_HASH_LENGTH increases the compilation time for C due to the complexity of the hashing macros.

PW_TOKENIZER_CFG_C_HASH_LENGTH has no effect on C++ code. In C++, hashing is done with a constexpr function instead of a macro. There are no string length limitations and compilation times are unaffected by this macro.

Only hash lengths for which there is a corresponding macro header (pw_tokenizer/internal/pw_tokenizer_65599_fixed_length_#_hash_macro.) are supported. Additional macros may be generated with the generate_hash_macro.py function. New macro headers must then be added to pw_tokenizer/internal/tokenize_string.h.

This MUST match the value of DEFAULT_C_HASH_LENGTH in pw_tokenizer/py/pw_tokenizer/tokens.py.

PW_TOKENIZER_CFG_ENCODING_BUFFER_SIZE_BYTES#

PW_TOKENIZER_CFG_ENCODING_BUFFER_SIZE_BYTES is deprecated. It is used as the default value for pw_log_tokenized’s PW_LOG_TOKENIZED_ENCODING_BUFFER_SIZE_BYTES . This value should not be configured; set PW_LOG_TOKENIZED_ENCODING_BUFFER_SIZE_BYTES instead.

PW_TOKENIZER_NESTED_PREFIX_STR#

Tokenization#

size_t pw::tokenizer::EncodeArgs(pw_tokenizer_ArgTypes types, va_list args, span<std::byte> output)#

Encodes a tokenized string’s arguments to a buffer. The pw_tokenizer_ArgTypes parameter specifies the argument types, in place of a format string.

Most tokenization implementations should use the EncodedMessage class.

template<size_t kMaxSizeBytes>
class EncodedMessage#

Encodes a tokenized message to a fixed size buffer. This class is used to encode tokenized messages passed in from tokenization macros.

To use pw::tokenizer::EncodedMessage, construct it with the token, argument types, and va_list from the variadic arguments:

void SendLogMessage(span<std::byte> log_data);

extern "C" void TokenizeToSendLogMessage(pw_tokenizer_Token token,
                                         pw_tokenizer_ArgTypes types,
                                         ...) {
  va_list args;
  va_start(args, types);
  EncodedMessage encoded_message(token, types, args);
  va_end(args);

  SendLogMessage(encoded_message);  // EncodedMessage converts to span
}

Public Functions

inline const std::byte *data() const#

The binary-encoded tokenized message.

inline const uint8_t *data_as_uint8() const#

Returns data() as a pointer to uint8_t instead of std::byte.

inline size_t size() const#

The size of the encoded tokenized message in bytes.

template<typename ...ArgTypes>
constexpr size_t pw::tokenizer::MinEncodingBufferSizeBytes()#

Calculates the minimum buffer size to allocate that is guaranteed to support encoding the specified arguments.

The contents of strings are NOT included in this total. The string’s length/status byte is guaranteed to fit, but the string contents may be truncated. Encoding is considered to succeed as long as the string’s length/status byte is written, even if the actual string is truncated.

Examples:

  • Message with no arguments: MinEncodingBufferSizeBytes() == 4

  • Message with an int argument MinEncodingBufferSizeBytes<int>() == 9 (4 + 5)

PW_TOKEN_FMT()#

Format specifier for a token argument.

PW_TOKENIZE_FORMAT_STRING(domain, mask, format, ...)#

Tokenizes a format string with optional arguments and sets the _pw_tokenizer_token variable to the token. Must be used in its own scope, since the same variable is used in every invocation.

The tokenized string uses the specified tokenization domain . Use PW_TOKENIZER_DEFAULT_DOMAIN for the default. The token also may be masked; use UINT32_MAX to keep all bits.

This macro checks that the printf-style format string matches the arguments and that no more than PW_TOKENIZER_MAX_SUPPORTED_ARGS are provided. It then stores the format string in a special section, and calculates the string’s token at compile time.

PW_TOKENIZE_FORMAT_STRING_ANY_ARG_COUNT(domain, mask, format, ...)#

Equivalent to PW_TOKENIZE_FORMAT_STRING, but supports any number of arguments.

This is a low-level macro that should rarely be used directly. It is intended for situations when pw_tokenizer_ArgTypes is not used. There are two situations where pw_tokenizer_ArgTypes is unnecessary:

  • The exact format string argument types and count are fixed.

  • The format string supports a variable number of arguments of only one type. In this case, PW_FUNCTION_ARG_COUNT may be used to pass the argument count to the function.

PW_TOKENIZE_STRING(string_literal)#

Converts a string literal to a pw_tokenizer_Token (uint32_t) token in a standalone statement. C and C++ compatible. In C++, the string may be a literal or a constexpr char array, including function variables like __func__. In C, the argument must be a string literal. In either case, the string must be null terminated, but may contain any characters (including ‘\0’).

constexpr uint32_t token = PW_TOKENIZE_STRING("Any string literal!");
PW_TOKENIZE_STRING_DOMAIN(domain, string_literal)#

Tokenizes a string literal in a standalone statement using the specified domain . C and C++ compatible.

PW_TOKENIZE_STRING_DOMAIN_EXPR(domain, string_literal)#

Tokenizes a string literal using the specified domain within an expression. Requires C++.

PW_TOKENIZE_STRING_EXPR(string_literal)#

Converts a string literal to a uint32_t token within an expression. Requires C++.

DoSomething(PW_TOKENIZE_STRING_EXPR("Succeed"));
PW_TOKENIZE_STRING_MASK(domain, mask, string_literal)#

Tokenizes a string literal in a standalone statement using the specified domain and bit mask . C and C++ compatible.

PW_TOKENIZE_STRING_MASK_EXPR(domain, mask, string_literal)#

Tokenizes a string literal using the specified domain and bit mask within an expression. Requires C++.

PW_TOKENIZE_TO_BUFFER(buffer, buffer_size_pointer, format, ...)#

Encodes a tokenized string and arguments to the provided buffer. The size of the buffer is passed via a pointer to a size_t. After encoding is complete, the size_t is set to the number of bytes written to the buffer.

The macro’s arguments are equivalent to the following function signature:

TokenizeToBuffer(void* buffer,
                 size_t* buffer_size_pointer,
                 const char* format,
                 ...);  // printf-style arguments

For example, the following encodes a tokenized string with a temperature to a buffer. The buffer is passed to a function to send the message over a UART.

uint8_t buffer[32];
size_t size_bytes = sizeof(buffer);
PW_TOKENIZE_TO_BUFFER(
    buffer, &size_bytes, "Temperature (C): %0.2f", temperature_c);
MyProject_EnqueueMessageForUart(buffer, size);

While PW_TOKENIZE_TO_BUFFER is very flexible, it must be passed a buffer, which increases its code size footprint at the call site.

PW_TOKENIZE_TO_BUFFER_DOMAIN(domain, buffer, buffer_size_pointer, format, ...)#

Same as PW_TOKENIZE_TO_BUFFER , but tokenizes to the specified domain .

PW_TOKENIZE_TO_BUFFER_MASK(domain, mask, buffer, buffer_size_pointer, format, ...)#

Same as PW_TOKENIZE_TO_BUFFER_DOMAIN , but applies a bit mask to the token.

PW_TOKENIZER_REPLACE_FORMAT_STRING(...)#

Low-level macro for calling functions that handle tokenized strings.

Functions that work with tokenized format strings must take the following arguments:

  • The 32-bit token ( pw_tokenizer_Token )

  • The 32- or 64-bit argument types ( pw_tokenizer_ArgTypes )

  • Variadic arguments, if any

This macro expands to those arguments. Custom tokenization macros should use this macro to pass these arguments to a function or other macro.

EncodeMyTokenizedString(uint32_t token,
                        pw_tokenier_ArgTypes arg_types,
                        ...);

#define CUSTOM_TOKENIZATION_MACRO(format, ...)                  \
  PW_TOKENIZE_FORMAT_STRING(domain, mask, format, __VA_ARGS__); \
  EncodeMyTokenizedString(PW_TOKENIZER_REPLACE_FORMAT_STRING(__VA_ARGS__))

PW_TOKENIZER_ARG_TYPES(...)#

Converts a series of arguments to a compact format that replaces the format string literal. Evaluates to a pw_tokenizer_ArgTypes value.

Depending on the size of pw_tokenizer_ArgTypes, the bottom 4 or 6 bits store the number of arguments and the remaining bits store the types, two bits per type. The arguments are not evaluated; only their types are used.

In general, PW_TOKENIZER_ARG_TYPES should not be used directly. Instead, use PW_TOKENIZER_REPLACE_FORMAT_STRING .

size_t pw_tokenizer_EncodeArgs(pw_tokenizer_ArgTypes types, va_list args, void *output_buffer, size_t output_buffer_size)#

C function that encodes arguments to a tokenized buffer. Use the pw::tokenizer::EncodeArgs() function from C++.

static inline size_t pw_tokenizer_EncodeInt(int value, void *output, size_t output_size_bytes)#

Encodes an int with the standard integer encoding: zig-zag + LEB128. This function is only necessary when manually encoding tokenized messages.

static inline size_t pw_tokenizer_EncodeInt64(int64_t value, void *output, size_t output_size_bytes)#

Encodes an int64_t with the standard integer encoding: zig-zag + LEB128. This function is only necessary when manually encoding tokenized messages.

typedef uint32_t pw_tokenizer_Token#

The type of the 32-bit token used in place of a string. Also available as pw::tokenizer::Token.

pw_tokenizer.encode.encode_token_and_args(token: int, *args: int | float | bytes | str) bytes#

Encodes a tokenized message given its token and arguments.

This function assumes that the token represents a format string with conversion specifiers that correspond with the provided argument types. Currently, only 32-bit integers are supported.

pw_tokenizer.tokens.pw_tokenizer_65599_hash(string: str | bytes, *, hash_length: int | None = None) int#

Hashes the string with the hash function used to generate tokens in C++.

This hash function is used calculate tokens from strings in Python. It is not used when extracting tokens from an ELF, since the token is stored in the ELF as part of tokenization.

Token databases#

class TokenDatabase#

Reads entries from a v0 binary token string database. This class does not copy or modify the contents of the database.

The v0 token database has two significant shortcomings:

  • Strings cannot contain null terminators (\0). If a string contains a \0, the database will not work correctly.

  • The domain is not included in entries. All tokens belong to a single domain, which must be known independently.

A v0 binary token database is comprised of a 16-byte header followed by an array of 8-byte entries and a table of null-terminated strings. The header specifies the number of entries. Each entry contains information about a tokenized string: the token and removal date, if any. All fields are little- endian.

The token removal date is stored within an unsigned 32-bit integer. It is stored as <day> <month> <year>, where <day> and <month> are 1 byte each and <year> is two bytes. The fields are set to their maximum value (0xFF or 0xFFFF) if they are unset. With this format, dates may be compared naturally as unsigned integers.

Header (16 bytes)

Offset

Size

Field

0

6

Magic number (TOKENS)

6

2

Version (00 00)

8

4

Entry count

12

4

Reserved

Entry (8 bytes)

Offset

Size

Field

0

4

Token

4

1

Removal day (1-31, 255 if unset)

5

1

Removal month (1-12, 255 if unset)

6

2

Removal year (65535 if unset)

Entries are sorted by token. A string table with a null-terminated string for each entry in order follows the entries.

Entries are accessed by iterating over the database. A O(n) Find function is also provided. In typical use, a TokenDatabase is preprocessed by a pw::tokenizer::Detokenizer into a std::unordered_map.

Public Functions

inline constexpr TokenDatabase()#

Creates a database with no data. ok() returns false.

Entries Find(uint32_t token) const#

Returns all entries associated with this token. This is O(n).

inline constexpr size_type size() const#

Returns the total number of entries (unique token-string pairs).

inline constexpr bool ok() const#

True if this database was constructed with valid data. The database might be empty, but it has an intact header and a string for each entry.

inline constexpr iterator begin() const#

Returns an iterator for the first token entry.

inline constexpr iterator end() const#

Returns an iterator for one past the last token entry.

Public Static Functions

template<typename ByteArray>
static inline constexpr bool IsValid(const ByteArray &bytes)#

Returns true if the provided data is a valid token database. This checks the magic number (TOKENS), version (which must be 0), and that there is is one string for each entry in the database. A database with extra strings or other trailing data is considered valid.

template<const auto &kDatabaseBytes>
static inline constexpr TokenDatabase Create()#

Creates a TokenDatabase and checks if the provided data is valid at compile time. Accepts references to constexpr containers (array, span, string_view, etc.) with static storage duration. For example:

constexpr char kMyData[] = ...;
constexpr TokenDatabase db = TokenDatabase::Create<kMyData>();
template<typename ByteArray>
static inline constexpr TokenDatabase Create(const ByteArray &database_bytes)#

Creates a TokenDatabase from the provided byte array. The array may be a span, array, or other container type. If the data is not valid, returns a default-constructed database for which ok() is false.

Prefer the Create overload that takes the data as a template parameter when possible, since that overload verifies data integrity at compile time.

Public Static Attributes

static constexpr uint32_t kDateRemovedNever = 0xFFFFFFFF#

Default date_removed for an entry in the token datase if it was never removed.

class Entries#

A list of token entries returned from a Find operation. This object can be iterated over or indexed as an array.

struct Entry#

An entry in the token database.

Public Members

uint32_t token#

The token that represents this string.

uint32_t date_removed#

The date the token and string was removed from the database, or kDateRemovedNever if it was never removed. Dates are encoded such that natural integer sorting sorts from oldest to newest dates. The day is stored an an 8-bit day, 8-bit month, and 16-bit year, packed into a little-endian uint32_t.

const char *string#

The null-terminated string represented by this token.

class iterator#

Iterator for TokenDatabase values.

Detokenization#

Decodes and detokenizes strings from binary or Base64 input.

The main class provided by this module is the Detokenize class. To use it, construct it with the path to an ELF or CSV database, a tokens.Database, or a file object for an ELF file or CSV. Then, call the detokenize method with encoded messages, one at a time. The detokenize method returns a DetokenizedString object with the result.

For example:

from pw_tokenizer import detokenize

detok = detokenize.Detokenizer('path/to/firmware/image.elf')
print(detok.detokenize(b'\x12\x34\x56\x78\x03hi!'))

This module also provides a command line interface for decoding and detokenizing messages from a file or stdin.

class pw_tokenizer.detokenize.AutoUpdatingDetokenizer(
*paths_or_files: ~pathlib.Path | str,
min_poll_period_s: float = 1.0,
pool: ~concurrent.futures._base.Executor = <concurrent.futures.thread.ThreadPoolExecutor object>,
)#

Loads and updates a detokenizer from database paths.

__init__(
*paths_or_files: ~pathlib.Path | str,
min_poll_period_s: float = 1.0,
pool: ~concurrent.futures._base.Executor = <concurrent.futures.thread.ThreadPoolExecutor object>,
) None#

Decodes and detokenizes binary messages.

Parameters:
  • *token_database_or_elf – a path or file object for an ELF or CSV database, a tokens.Database, or an elf_reader.Elf

  • show_errors – if True, an error message is used in place of the % conversion specifier when an argument fails to decode

lookup(token: int) list[pw_tokenizer.detokenize._TokenizedFormatString]#

Returns (TokenizedStringEntry, FormatString) list for matches.

class pw_tokenizer.detokenize.DetokenizedString(
token: int | None,
format_string_entries: Iterable[tuple],
encoded_message: bytes,
show_errors: bool = False,
recursive_detokenize: Callable[[str], str] | None = None,
)#

A detokenized string, with all results if there are collisions.

__init__(
token: int | None,
format_string_entries: Iterable[tuple],
encoded_message: bytes,
show_errors: bool = False,
recursive_detokenize: Callable[[str], str] | None = None,
)#
best_result() FormattedString | None#

Returns the string and args for the most likely decoded string.

error_message() str#

If detokenization failed, returns a descriptive message.

matches() list[pw_tokenizer.decode.FormattedString]#

Returns the strings that matched the token, best matches first.

ok() bool#

True if exactly one string decoded the arguments successfully.

class pw_tokenizer.detokenize.Detokenizer(*token_database_or_elf, show_errors: bool = False)#

Main detokenization class; detokenizes strings and caches results.

__init__(*token_database_or_elf, show_errors: bool = False)#

Decodes and detokenizes binary messages.

Parameters:
  • *token_database_or_elf – a path or file object for an ELF or CSV database, a tokens.Database, or an elf_reader.Elf

  • show_errors – if True, an error message is used in place of the % conversion specifier when an argument fails to decode

detokenize(
encoded_message: bytes,
prefix: str | bytes = b'$',
recursion: int = 9,
) DetokenizedString#

Decodes and detokenizes a message as a DetokenizedString.

detokenize_base64(data: AnyStr, prefix: str | bytes = b'$', recursion: int = 9) AnyStr#

Alias of detokenize_text for backwards compatibility.

detokenize_base64_live(
input_file: RawIOBase | BinaryIO,
output: BinaryIO,
prefix: str | bytes = b'$',
recursion: int = 9,
) None#

Alias of detokenize_text_live for backwards compatibility.

detokenize_base64_to_file(
data: AnyStr,
output: BinaryIO,
prefix: str | bytes = b'$',
recursion: int = 9,
) None#

Alias of detokenize_text_to_file for backwards compatibility.

detokenize_text(data: AnyStr, prefix: str | bytes = b'$', recursion: int = 9) AnyStr#

Decodes and replaces prefixed Base64 messages in the provided data.

Parameters:
  • data – the binary data to decode

  • prefix – one-character byte string that signals the start of a message

  • recursion – how many levels to recursively decode

Returns:

copy of the data with all recognized tokens decoded

detokenize_text_live(
input_file: RawIOBase | BinaryIO,
output: BinaryIO,
prefix: str | bytes = b'$',
recursion: int = 9,
) None#

Reads chars one-at-a-time, decoding messages; SLOW for big files.

detokenize_text_to_file(
data: AnyStr,
output: BinaryIO,
prefix: str | bytes = b'$',
recursion: int = 9,
) None#

Decodes prefixed Base64 messages in data; decodes to output file.

lookup(token: int) list[pw_tokenizer.detokenize._TokenizedFormatString]#

Returns (TokenizedStringEntry, FormatString) list for matches.

class pw_tokenizer.detokenize.NestedMessageParser(
prefix: str | bytes = b'$',
chars: str | bytes = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789+/-_=',
)#

Parses nested tokenized messages from a byte stream or string.

__init__(
prefix: str | bytes = b'$',
chars: str | bytes = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789+/-_=',
) None#

Initializes a parser.

Parameters:
  • prefix – one character that signifies the start of a message ($).

  • chars – characters allowed in a message

read_messages(chunk: bytes, *, flush: bool = False) Iterator[tuple[bool, bytes]]#

Reads prefixed messages from a byte string.

This function may be called repeatedly with chunks of a stream. Partial messages are preserved between calls, unless flush=True.

Parameters:
  • chunk – byte string that may contain nested messagses

  • flush – whether to flush any incomplete messages after processing this chunk

Yields:

(is_message, contents) chunks.

read_messages_io(binary_io: RawIOBase | BinaryIO) Iterator[tuple[bool, bytes]]#

Reads prefixed messages from a byte stream (BinaryIO object).

Reads until EOF. If the stream is nonblocking (read(1) returns None), then this function returns and may be called again with the same IO object to continue parsing. Partial messages are preserved between calls.

Yields:

(is_message, contents) chunks.

transform(chunk: bytes, transform: Callable[[bytes], bytes], *, flush: bool = False) bytes#

Yields the chunk with a transformation applied to the messages.

Partial messages are preserved between calls unless flush=True.

transform_io(
binary_io: RawIOBase | BinaryIO,
transform: Callable[[bytes], bytes],
) Iterator[bytes]#

Yields the file with a transformation applied to the messages.

pw_tokenizer.detokenize.detokenize_base64(
detokenizer: Detokenizer,
data: bytes,
prefix: str | bytes = b'$',
recursion: int = 9,
) bytes#

Alias for detokenizer.detokenize_base64 for backwards compatibility.

This function is deprecated; do not call it.

Utilities for working with tokenized fields in protobufs.

pw_tokenizer.proto.decode_optionally_tokenized(detokenizer: Detokenizer, data: bytes, prefix: str = '$') str#

Decodes data that may be plain text or binary / Base64 tokenized text.

pw_tokenizer.proto.detokenize_fields(
detokenizer: Detokenizer,
proto: Message,
prefix: str = '$',
) None#

Detokenizes fields annotated as tokenized in the given proto.

The fields are replaced with their detokenized version in the proto. Tokenized fields are bytes fields, so the detokenized string is stored as bytes. Call .decode() to convert the detokenized string from bytes to str.