pw_tokenizer API reference#
pw_tokenizer: Compress strings to shrink logs by +75%
Configuration#
Defines
-
PW_TOKENIZER_CFG_ARG_TYPES_SIZE_BYTES#
For a tokenized string with arguments, the types of the arguments are encoded in either 4 bytes (
uint32_t
) or 8 bytes (uint64_t
). 4 bytes supports up to 14 tokenized string arguments; 8 bytes supports up to 29 arguments. Using 8 bytes increases code size for 32-bit machines.Argument types are encoded two bits per argument, in little-endian order. The 4 or 6 least-significant bits, respectively, store the number of arguments, while the remaining bits encode the argument types.
-
PW_TOKENIZER_CFG_C_HASH_LENGTH#
Maximum number of characters to hash in C. In C code, strings shorter than this length are treated as if they were zero-padded up to the length. Strings that are the same length and share a common prefix longer than this value hash to the same value. Increasing
PW_TOKENIZER_CFG_C_HASH_LENGTH
increases the compilation time for C due to the complexity of the hashing macros.PW_TOKENIZER_CFG_C_HASH_LENGTH
has no effect on C++ code. In C++, hashing is done with aconstexpr
function instead of a macro. There are no string length limitations and compilation times are unaffected by this macro.Only hash lengths for which there is a corresponding macro header (
pw_tokenizer/internal/pw_tokenizer_65599_fixed_length_#_hash_macro.
) are supported. Additional macros may be generated with thegenerate_hash_macro.py
function. New macro headers must then be added topw_tokenizer/internal/tokenize_string.h
.This MUST match the value of
DEFAULT_C_HASH_LENGTH
inpw_tokenizer/py/pw_tokenizer/tokens.py
.
-
PW_TOKENIZER_CFG_ENCODING_BUFFER_SIZE_BYTES#
PW_TOKENIZER_CFG_ENCODING_BUFFER_SIZE_BYTES
is deprecated. It is used as the default value for pw_log_tokenized’sPW_LOG_TOKENIZED_ENCODING_BUFFER_SIZE_BYTES
. This value should not be configured; setPW_LOG_TOKENIZED_ENCODING_BUFFER_SIZE_BYTES
instead.
-
PW_TOKENIZER_NESTED_PREFIX_STR#
Tokenization#
-
size_t pw::tokenizer::EncodeArgs(pw_tokenizer_ArgTypes types, va_list args, span<std::byte> output)#
Encodes a tokenized string’s arguments to a buffer. The
pw_tokenizer_ArgTypes
parameter specifies the argument types, in place of a format string.Most tokenization implementations should use the
EncodedMessage
class.
-
template<size_t kMaxSizeBytes>
class EncodedMessage# Encodes a tokenized message to a fixed size buffer. This class is used to encode tokenized messages passed in from tokenization macros.
To use
pw::tokenizer::EncodedMessage
, construct it with the token, argument types, andva_list
from the variadic arguments:void SendLogMessage(span<std::byte> log_data); extern "C" void TokenizeToSendLogMessage(pw_tokenizer_Token token, pw_tokenizer_ArgTypes types, ...) { va_list args; va_start(args, types); EncodedMessage encoded_message(token, types, args); va_end(args); SendLogMessage(encoded_message); // EncodedMessage converts to span }
-
template<typename ...ArgTypes>
constexpr size_t pw::tokenizer::MinEncodingBufferSizeBytes()# Calculates the minimum buffer size to allocate that is guaranteed to support encoding the specified arguments.
The contents of strings are NOT included in this total. The string’s length/status byte is guaranteed to fit, but the string contents may be truncated. Encoding is considered to succeed as long as the string’s length/status byte is written, even if the actual string is truncated.
Examples:
Message with no arguments:
MinEncodingBufferSizeBytes() == 4
Message with an int argument
MinEncodingBufferSizeBytes<int>() == 9 (4 + 5)
-
template<typename T>
constexpr auto pw::tokenizer::EnumToToken(T value)# Tokenizes a given enumerator value. Used in the vase of a tokenizing log backend.
- Parameters:
value – enumerator value
- Returns:
The 32-bit token (
pw_tokenizer_Token
)
-
template<typename T>
constexpr const char *pw::tokenizer::EnumToString(T value)# Returns a string representation of a given enumerator value name. Used in the case of a non-tokenizing log backend.
Returns a string representation of a given enumerator value name.
- Parameters:
value – enumerator value
- Returns:
constexpr char array
-
PW_TOKEN_FMT(...)#
Format specifier for a token argument.
-
PW_NESTED_TOKEN_FMT(...)#
Format specifier for a doubly-nested token argument. Doubly-nested token arguments are useful when the domain and/or token may not be known at the time of logging. For example, if an external function is required to return a domain and token for logging,
PW_NESTED_TOKEN_FMT
still allows for the value to be logged as it tokenizes the domain value as well. It can either take an argument of a domain value, or no argument at all if there is no specified domain.PW_NESTED_TOKEN_FMT() expands to ${$#x}#%08x PW_NESTED_TOKEN_FMT(domain_value) expands to ${${domain_value}#x}#%08x
An example of its application could look similar to this:
std::pair<PW_LOG_TOKEN_TYPE, PW_LOG_TOKEN_TYPE> GetDomainAndToken(...) {...} const auto [domain, token] = GetDomainAndToken(...); PW_LOG("Nested Token " PW_NESTED_TOKEN_FMT("enum_domain"), domain, token);
-
PW_TOKENIZE_ENUM(fully_qualified_name, ...)#
Tokenizes the given values within an enumerator. All values of the enumerator must be present to compile and have the enumerator be tokenized successfully. This macro should be in the same namespace as the enum declaration to use the
pw::tokenizer::EnumToString
function and avoid compilation errors.
-
PW_TOKENIZE_ENUM_CUSTOM(fully_qualified_name, ...)#
Tokenizes a custom string for each given values within an enumerator. All values of the enumerator must be followed by a custom string as a tuple (value, “string”). All values of the enumerator (and their associated custom string) must be present to compile and have the custom strings be tokenized successfully. This macro should be in the same namespace as the enum declaration to use the
pw::tokenizer::EnumToString
function and avoid compilation errors.
-
PW_TOKENIZE_FORMAT_STRING(domain, mask, format, ...)#
Tokenizes a format string with optional arguments and sets the
_pw_tokenizer_token
variable to the token. Must be used in its own scope, since the same variable is used in every invocation.The tokenized string uses the specified tokenization domain . Use
PW_TOKENIZER_DEFAULT_DOMAIN
for the default. The token also may be masked; useUINT32_MAX
to keep all bits.This macro checks that the printf-style format string matches the arguments and that no more than
PW_TOKENIZER_MAX_SUPPORTED_ARGS
are provided. It then stores the format string in a special section, and calculates the string’s token at compile time.
-
PW_TOKENIZE_FORMAT_STRING_ANY_ARG_COUNT(domain, mask, format, ...)#
Equivalent to
PW_TOKENIZE_FORMAT_STRING
, but supports any number of arguments.This is a low-level macro that should rarely be used directly. It is intended for situations when
pw_tokenizer_ArgTypes
is not used. There are two situations wherepw_tokenizer_ArgTypes
is unnecessary:The exact format string argument types and count are fixed.
The format string supports a variable number of arguments of only one type. In this case,
PW_FUNCTION_ARG_COUNT
may be used to pass the argument count to the function.
-
PW_TOKENIZE_STRING(string_literal)#
Converts a string literal to a
pw_tokenizer_Token
(uint32_t
) token in a standalone statement. C and C++ compatible. In C++, the string may be a literal or a constexpr char array, including function variables like__func__
. In C, the argument must be a string literal. In either case, the string must be null terminated, but may contain any characters (including ‘\0’).constexpr uint32_t token = PW_TOKENIZE_STRING("Any string literal!");
-
PW_TOKENIZE_STRING_DOMAIN(domain, string_literal)#
Tokenizes a string literal in a standalone statement using the specified domain . C and C++ compatible.
-
PW_TOKENIZE_STRING_DOMAIN_EXPR(domain, string_literal)#
Tokenizes a string literal using the specified domain within an expression. Requires C++.
-
PW_TOKENIZE_STRING_EXPR(string_literal)#
Converts a string literal to a
uint32_t
token within an expression. Requires C++.DoSomething(PW_TOKENIZE_STRING_EXPR("Succeed"));
-
PW_TOKENIZE_STRING_MASK(domain, mask, string_literal)#
Tokenizes a string literal in a standalone statement using the specified domain and bit mask . C and C++ compatible.
-
PW_TOKENIZE_STRING_MASK_EXPR(domain, mask, string_literal)#
Tokenizes a string literal using the specified domain and bit mask within an expression. Requires C++.
-
PW_TOKENIZE_TO_BUFFER(buffer, buffer_size_pointer, format, ...)#
Encodes a tokenized string and arguments to the provided buffer. The size of the buffer is passed via a pointer to a
size_t
. After encoding is complete, thesize_t
is set to the number of bytes written to the buffer.The macro’s arguments are equivalent to the following function signature:
TokenizeToBuffer(void* buffer, size_t* buffer_size_pointer, const char* format, ...); // printf-style arguments
For example, the following encodes a tokenized string with a temperature to a buffer. The buffer is passed to a function to send the message over a UART.
uint8_t buffer[32]; size_t size_bytes = sizeof(buffer); PW_TOKENIZE_TO_BUFFER( buffer, &size_bytes, "Temperature (C): %0.2f", temperature_c); MyProject_EnqueueMessageForUart(buffer, size);
While
PW_TOKENIZE_TO_BUFFER
is very flexible, it must be passed a buffer, which increases its code size footprint at the call site.
-
PW_TOKENIZE_TO_BUFFER_DOMAIN(domain, buffer, buffer_size_pointer, format, ...)#
Same as
PW_TOKENIZE_TO_BUFFER
, but tokenizes to the specified domain .
-
PW_TOKENIZE_TO_BUFFER_MASK(domain, mask, buffer, buffer_size_pointer, format, ...)#
Same as
PW_TOKENIZE_TO_BUFFER_DOMAIN
, but applies a bit mask to the token.
-
PW_TOKENIZER_REPLACE_FORMAT_STRING(...)#
Low-level macro for calling functions that handle tokenized strings.
Functions that work with tokenized format strings must take the following arguments:
The 32-bit token (
pw_tokenizer_Token
)The 32- or 64-bit argument types (
pw_tokenizer_ArgTypes
)Variadic arguments, if any
This macro expands to those arguments. Custom tokenization macros should use this macro to pass these arguments to a function or other macro.
EncodeMyTokenizedString(uint32_t token, pw_tokenier_ArgTypes arg_types, ...); #define CUSTOM_TOKENIZATION_MACRO(format, ...) \ PW_TOKENIZE_FORMAT_STRING(domain, mask, format, __VA_ARGS__); \ EncodeMyTokenizedString(PW_TOKENIZER_REPLACE_FORMAT_STRING(__VA_ARGS__))
-
PW_TOKENIZER_ARG_TYPES(...)#
Converts a series of arguments to a compact format that replaces the format string literal. Evaluates to a
pw_tokenizer_ArgTypes
value.Depending on the size of
pw_tokenizer_ArgTypes
, the bottom 4 or 6 bits store the number of arguments and the remaining bits store the types, two bits per type. The arguments are not evaluated; only their types are used.In general,
PW_TOKENIZER_ARG_TYPES
should not be used directly. Instead, usePW_TOKENIZER_REPLACE_FORMAT_STRING
.
-
PW_TOKENIZER_DEFINE_TOKEN(token, domain, string)#
Records the original token, domain and string directly.
This macro is intended to be used for tokenized enum and domain support. The values are stored as an entry in the ELF section. As a note for tokenized enum support, the enum name should be used as the string, and the enum value as the token.
-
size_t pw_tokenizer_EncodeArgs(pw_tokenizer_ArgTypes types, va_list args, void *output_buffer, size_t output_buffer_size)#
C function that encodes arguments to a tokenized buffer. Use the
pw::tokenizer::EncodeArgs()
function from C++.
-
static inline size_t pw_tokenizer_EncodeInt(int value, void *output, size_t output_size_bytes)#
Encodes an
int
with the standard integer encoding: zig-zag + LEB128. This function is only necessary when manually encoding tokenized messages.
-
static inline size_t pw_tokenizer_EncodeInt64(int64_t value, void *output, size_t output_size_bytes)#
Encodes an
int64_t
with the standard integer encoding: zig-zag + LEB128. This function is only necessary when manually encoding tokenized messages.
-
typedef uint32_t pw_tokenizer_Token#
The type of the 32-bit token used in place of a string. Also available as
pw::tokenizer::Token
.
- pw_tokenizer.encode.encode_token_and_args(token: int, *args: int | float | bytes | str) bytes #
Encodes a tokenized message given its token and arguments.
This function assumes that the token represents a format string with conversion specifiers that correspond with the provided argument types. Currently, only 32-bit integers are supported.
- pw_tokenizer.tokens.pw_tokenizer_65599_hash(string: str | bytes, *, hash_length: int | None = None) int #
Hashes the string with the hash function used to generate tokens in C++.
This hash function is used calculate tokens from strings in Python. It is not used when extracting tokens from an ELF, since the token is stored in the ELF as part of tokenization.
See Crate pw_tokenizer.
Token databases#
-
class TokenDatabase#
Reads entries from a v0 binary token string database. This class does not copy or modify the contents of the database.
The v0 token database has two significant shortcomings:
Strings cannot contain null terminators (
\0
). If a string contains a\0
, the database will not work correctly.The domain is not included in entries. All tokens belong to a single domain, which must be known independently.
A v0 binary token database is comprised of a 16-byte header followed by an array of 8-byte entries and a table of null-terminated strings. The header specifies the number of entries. Each entry contains information about a tokenized string: the token and removal date, if any. All fields are little- endian.
The token removal date is stored within an unsigned 32-bit integer. It is stored as
<day> <month> <year>
, where<day>
and<month>
are 1 byte each and<year>
is two bytes. The fields are set to their maximum value (0xFF
or0xFFFF
) if they are unset. With this format, dates may be compared naturally as unsigned integers.Header (16 bytes)
Offset
Size
Field
0
6
Magic number (
TOKENS
)6
2
Version (
00 00
)8
4
Entry count
12
4
Reserved
Entry (8 bytes)
Offset
Size
Field
0
4
Token
4
1
Removal day (1-31, 255 if unset)
5
1
Removal month (1-12, 255 if unset)
6
2
Removal year (65535 if unset)
Entries are sorted by token. A string table with a null-terminated string for each entry in order follows the entries.
Entries are accessed by iterating over the database. A O(n)
Find
function is also provided. In typical use, aTokenDatabase
is preprocessed by apw::tokenizer::Detokenizer
into astd::unordered_map
.Public Functions
-
inline constexpr size_type size() const#
Returns the total number of entries (unique token-string pairs).
-
inline constexpr bool ok() const#
True if this database was constructed with valid data. The database might be empty, but it has an intact header and a string for each entry.
Public Static Functions
-
template<typename ByteArray>
static inline constexpr bool IsValid(const ByteArray &bytes)# Returns true if the provided data is a valid token database. This checks the magic number (
TOKENS
), version (which must be0
), and that there is is one string for each entry in the database. A database with extra strings or other trailing data is considered valid.
-
template<const auto &kDatabaseBytes>
static inline constexpr TokenDatabase Create()# Creates a
TokenDatabase
and checks if the provided data is valid at compile time. Accepts references to constexpr containers (array
,span
,string_view
, etc.) with static storage duration. For example:constexpr char kMyData[] = ...; constexpr TokenDatabase db = TokenDatabase::Create<kMyData>();
-
template<typename ByteArray>
static inline constexpr TokenDatabase Create(const ByteArray &database_bytes)# Creates a
TokenDatabase
from the provided byte array. The array may be a span, array, or other container type. If the data is not valid, returns a default-constructed database for which ok() is false.Prefer the
Create
overload that takes the data as a template parameter when possible, since that overload verifies data integrity at compile time.
Public Static Attributes
-
static constexpr uint32_t kDateRemovedNever = 0xFFFFFFFF#
Default date_removed for an entry in the token datase if it was never removed.
-
class Entries#
A list of token entries returned from a
Find
operation. This object can be iterated over or indexed as an array.
-
struct Entry#
An entry in the token database.
Public Members
-
uint32_t token#
The token that represents this string.
-
uint32_t date_removed#
The date the token and string was removed from the database, or kDateRemovedNever if it was never removed. Dates are encoded such that natural integer sorting sorts from oldest to newest dates. The day is stored an an 8-bit day, 8-bit month, and 16-bit year, packed into a little-endian
uint32_t
.
-
const char *string#
The null-terminated string represented by this token.
-
uint32_t token#
-
class iterator#
Iterator for
TokenDatabase
values.
Detokenization#
-
using TokenizedStringEntry = std::pair<FormatString, uint32_t>#
Token database entry.
-
class DetokenizedString#
A string that has been detokenized. This class tracks all possible results if there are token collisions.
Public Functions
-
inline bool ok() const#
True if there was only one valid match and it decoded successfully.
-
inline const std::vector<DecodedFormatString> &matches() const#
Returns the strings that matched the token, with the best matches first.
-
std::string BestString() const#
Returns the detokenized string or an empty string if there were no matches. If there are multiple possible results, the
DetokenizedString
returns the first match.
-
std::string BestStringWithErrors() const#
Returns the best match, with error messages inserted for arguments that failed to parse.
-
inline bool ok() const#
-
class Detokenizer#
Decodes and detokenizes from a token database. This class builds a hash table of tokens to give
O(1)
token lookups.Public Functions
-
explicit Detokenizer(const TokenDatabase &database)#
Constructs a detokenizer from a
TokenDatabase
. TheTokenDatabase
is not referenced by theDetokenizer
after construction; its memory can be freed.
-
inline explicit Detokenizer(std::unordered_map<uint32_t, std::vector<TokenizedStringEntry>> &&database)#
Constructs a detokenizer by directly passing the parsed database.
-
DetokenizedString Detokenize(const span<const std::byte> &encoded) const#
Decodes and detokenizes the binary encoded message. Returns a
DetokenizedString
that stores all possible detokenized string results.
-
inline DetokenizedString Detokenize(const span<const uint8_t> &encoded) const#
Overload of
Detokenize
forspan<const uint8_t>
.
-
inline DetokenizedString Detokenize(std::string_view encoded) const#
Overload of
Detokenize
forstd::string_view
.
-
inline DetokenizedString Detokenize(const void *encoded, size_t size_bytes) const#
Overload of
Detokenize
for a pointer and length.
-
DetokenizedString DetokenizeBase64Message(std::string_view text) const#
Decodes and detokenizes a Base64-encoded message. Returns a
DetokenizedString
that stores all possible detokenized string results.
-
std::string DetokenizeText(std::string_view text, unsigned max_passes = 3) const#
Decodes and detokenizes nested tokenized messages in a string.
This function currently only supports Base64 nested tokenized messages. Support for hexadecimal-encoded string literals will be added.
- Parameters:
text – [in] Text potentially containing tokenized messages.
max_passes – [in]
DetokenizeText
supports recursive detokenization. Tokens can expand to other tokens. The maximum number of detokenization passes is specified bymax_passes
(0 is equivalent to 1).
- Returns:
The original string with nested tokenized messages decoded in context. Messages that fail to decode are left as-is.
-
inline std::string DetokenizeBase64(std::string_view text) const#
Deprecated version of
DetokenizeText
with no recursive detokenization.- Deprecated:
Call
DetokenizeText
instead.
-
std::string DecodeOptionallyTokenizedData(const span<const std::byte> &optionally_tokenized_data)#
Decodes data that may or may not be tokenized, such as proto fields marked as optionally tokenized.
This function currently only supports Base64 nested tokenized messages. Support for hexadecimal-encoded string literals will be added.
This function currently assumes when data is not tokenized it is printable ASCII. Otherwise, the returned string will be base64-encoded.
- Parameters:
optionally_tokenized_data – [in] Data optionally tokenized.
- Returns:
The decoded text if successfully detokenized or if the data is printable, otherwise returns the data base64-encoded.
Public Static Functions
-
static Result<Detokenizer> FromElfSection(span<const std::byte> elf_section)#
Constructs a detokenizer from the
.pw_tokenizer.entries
section of an ELF binary.
-
static inline Result<Detokenizer> FromElfSection(span<const uint8_t> elf_section)#
Overload of
FromElfSection
for auint8_t
span.
-
static Result<Detokenizer> FromElfFile(stream::SeekableReader &stream)#
Constructs a detokenizer from the
.pw_tokenizer.entries
section of an ELF binary.
-
explicit Detokenizer(const TokenDatabase &database)#
Decodes and detokenizes strings from binary or Base64 input.
The main class provided by this module is the Detokenize class. To use it, construct it with the path to an ELF or CSV database, a tokens.Database, or a file object for an ELF file or CSV. Then, call the detokenize method with encoded messages, one at a time. The detokenize method returns a DetokenizedString object with the result.
For example:
from pw_tokenizer import detokenize
detok = detokenize.Detokenizer('path/to/firmware/image.elf')
print(detok.detokenize(b'\x12\x34\x56\x78\x03hi!'))
This module also provides a command line interface for decoding and detokenizing messages from a file or stdin.
- class pw_tokenizer.detokenize.AutoUpdatingDetokenizer(
- *paths_or_files: ~pathlib.Path | str,
- min_poll_period_s: float = 1.0,
- pool: ~concurrent.futures._base.Executor = <concurrent.futures.thread.ThreadPoolExecutor object>,
- prefix: str | bytes = '$',
Loads and updates a detokenizer from database paths.
- __init__(
- *paths_or_files: ~pathlib.Path | str,
- min_poll_period_s: float = 1.0,
- pool: ~concurrent.futures._base.Executor = <concurrent.futures.thread.ThreadPoolExecutor object>,
- prefix: str | bytes = '$',
Decodes and detokenizes binary messages.
- Parameters:
*token_database_or_elf – a path or file object for an ELF or CSV database, a tokens.Database, or an elf_reader.Elf
prefix – one-character byte string that signals the start of a message
show_errors – if True, an error message is used in place of the % conversion specifier when an argument fails to decode
- lookup(token: int) list[_TokenizedFormatString] #
Returns (TokenizedStringEntry, FormatString) list for matches.
- class pw_tokenizer.detokenize.DetokenizedString(
- token: int | None,
- format_string_entries: Iterable[tuple],
- encoded_message: bytes,
- show_errors: bool = False,
- recursive_detokenize: Callable[[str], str] | None = None,
A detokenized string, with all results if there are collisions.
- __init__(
- token: int | None,
- format_string_entries: Iterable[tuple],
- encoded_message: bytes,
- show_errors: bool = False,
- recursive_detokenize: Callable[[str], str] | None = None,
- best_result() FormattedString | None #
Returns the string and args for the most likely decoded string.
- error_message() str #
If detokenization failed, returns a descriptive message.
- matches() list[FormattedString] #
Returns the strings that matched the token, best matches first.
- ok() bool #
True if exactly one string decoded the arguments successfully.
- class pw_tokenizer.detokenize.Detokenizer(*token_database_or_elf, show_errors: bool = False, prefix: str | bytes = '$')#
Main detokenization class; detokenizes strings and caches results.
- __init__(*token_database_or_elf, show_errors: bool = False, prefix: str | bytes = '$')#
Decodes and detokenizes binary messages.
- Parameters:
*token_database_or_elf – a path or file object for an ELF or CSV database, a tokens.Database, or an elf_reader.Elf
prefix – one-character byte string that signals the start of a message
show_errors – if True, an error message is used in place of the % conversion specifier when an argument fails to decode
- detokenize(encoded_message: bytes, recursion: int = 9) DetokenizedString #
Decodes and detokenizes a message as a DetokenizedString.
- detokenize_base64(data: AnyStr, recursion: int = 9) AnyStr #
Alias of detokenize_text for backwards compatibility.
- detokenize_base64_live(
- input_file: RawIOBase | BinaryIO,
- output: BinaryIO,
- recursion: int = 9,
Alias of detokenize_text_live for backwards compatibility.
- detokenize_base64_to_file(data: AnyStr, output: BinaryIO, recursion: int = 9) None #
Alias of detokenize_text_to_file for backwards compatibility.
- detokenize_text(data: AnyStr, recursion: int = 9) AnyStr #
Decodes and replaces prefixed Base64 messages in the provided data.
- Parameters:
data – the binary data to decode
recursion – how many levels to recursively decode
- Returns:
copy of the data with all recognized tokens decoded
- detokenize_text_live(
- input_file: RawIOBase | BinaryIO,
- output: BinaryIO,
- recursion: int = 9,
Reads chars one-at-a-time, decoding messages; SLOW for big files.
- detokenize_text_to_file(data: AnyStr, output: BinaryIO, recursion: int = 9) None #
Decodes prefixed Base64 messages in data; decodes to output file.
- lookup(token: int) list[_TokenizedFormatString] #
Returns (TokenizedStringEntry, FormatString) list for matches.
- class pw_tokenizer.detokenize.NestedMessageParser(
- prefix: str | bytes = '$',
- chars: str | bytes = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789+/-_=',
Parses nested tokenized messages from a byte stream or string.
- __init__(
- prefix: str | bytes = '$',
- chars: str | bytes = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789+/-_=',
Initializes a parser.
- Parameters:
prefix – one character that signifies the start of a message (
$
).chars – characters allowed in a message
- read_messages(chunk: bytes, *, flush: bool = False) Iterator[tuple[bool, bytes]] #
Reads prefixed messages from a byte string.
This function may be called repeatedly with chunks of a stream. Partial messages are preserved between calls, unless
flush=True
.- Parameters:
chunk – byte string that may contain nested messagses
flush – whether to flush any incomplete messages after processing this chunk
- Yields:
(is_message, contents)
chunks.
- read_messages_io(binary_io: RawIOBase | BinaryIO) Iterator[tuple[bool, bytes]] #
Reads prefixed messages from a byte stream (BinaryIO object).
Reads until EOF. If the stream is nonblocking (
read(1)
returnsNone
), then this function returns and may be called again with the same IO object to continue parsing. Partial messages are preserved between calls.- Yields:
(is_message, contents)
chunks.
- transform(chunk: bytes, transform: Callable[[bytes], bytes], *, flush: bool = False) bytes #
Yields the chunk with a transformation applied to the messages.
Partial messages are preserved between calls unless
flush=True
.
- transform_io(
- binary_io: RawIOBase | BinaryIO,
- transform: Callable[[bytes], bytes],
Yields the file with a transformation applied to the messages.
- pw_tokenizer.detokenize.detokenize_base64(detokenizer: Detokenizer, data: bytes, recursion: int = 9) bytes #
Alias for detokenizer.detokenize_base64 for backwards compatibility.
This function is deprecated; do not call it.
Utilities for working with tokenized fields in protobufs.
- pw_tokenizer.proto.decode_optionally_tokenized(detokenizer: Detokenizer | None, data: bytes) str #
Decodes data that may be plain text or binary / Base64 tokenized text.
- Parameters:
detokenizer – detokenizer to use; if None, binary logs as Base64 encoded
data – encoded text or binary data
- pw_tokenizer.proto.detokenize_fields(detokenizer: Detokenizer | None, proto: Message) None #
Detokenizes fields annotated as tokenized in the given proto.
The fields are replaced with their detokenized version in the proto. Tokenized fields are bytes fields, so the detokenized string is stored as bytes. Call .decode() to convert the detokenized string from bytes to str.