API reference#

pw_tokenizer: Compress strings to shrink logs by +75%

C/C++#

Moved: pw_tokenizer

Rust#

See Crate pw_tokenizer.

Python#

Tokenization#

pw_tokenizer.encode.encode_token_and_args(token: int, *args: int | float | bytes | str) → bytes#

Encodes a tokenized message given its token and arguments.

This function assumes that the token represents a format string with conversion specifiers that correspond with the provided argument types. Currently, only 32-bit integers are supported.

pw_tokenizer.tokens.pw_tokenizer_65599_hash(string: str | bytes, *, hash_length: int | None = None) → int#

Hashes the string with the hash function used to generate tokens in C++.

This hash function is used calculate tokens from strings in Python. It is not used when extracting tokens from an ELF, since the token is stored in the ELF as part of tokenization.

Detokenization#

Decodes and detokenizes strings from binary or Base64 input.

The main class provided by this module is the Detokenize class. To use it, construct it with the path to an ELF or CSV database, a tokens.Database, or a file object for an ELF file or CSV. Then, call the detokenize method with encoded messages, one at a time. The detokenize method returns a DetokenizedString object with the result.

For example:

from pw_tokenizer import detokenize

detok = detokenize.Detokenizer('path/to/firmware/image.elf')
print(detok.detokenize(b'\x12\x34\x56\x78\x03hi!'))

This module also provides a command line interface for decoding and detokenizing messages from a file or stdin.

class pw_tokenizer.detokenize.AutoUpdatingDetokenizer( *paths_or_files: ~pathlib.Path | str, min_poll_period_s: float = 1.0, pool: ~concurrent.futures._base.Executor = <concurrent.futures.thread.ThreadPoolExecutor object>, prefix: str | bytes = '$', )#

Loads and updates a detokenizer from database paths.

__init__( *paths_or_files: ~pathlib.Path | str, min_poll_period_s: float = 1.0, pool: ~concurrent.futures._base.Executor = <concurrent.futures.thread.ThreadPoolExecutor object>, prefix: str | bytes = '$', ) → None#

Decodes and detokenizes binary messages.

Parameters:

*token_database_or_elf – a path or file object for an ELF or CSV database, a tokens.Database, or an elf_reader.Elf
prefix – one-character byte string that signals the start of a message
show_errors – if True, an error message is used in place of the % conversion specifier when an argument fails to decode

lookup(token: int) → list[_TokenizedFormatString]#: Returns (TokenizedStringEntry, FormatString) list for matches.

class pw_tokenizer.detokenize.DetokenizedString( token: int | None, format_string_entries: Iterable[tuple], encoded_message: bytes, show_errors: bool = False, recursive_detokenize: Callable[[str], str] | None = None, )#

A detokenized string, with all results if there are collisions.

__init__( token: int | None, format_string_entries: Iterable[tuple], encoded_message: bytes, show_errors: bool = False, recursive_detokenize: Callable[[str], str] | None = None, )#

best_result() → FormattedString | None#: Returns the string and args for the most likely decoded string.

error_message() → str#: If detokenization failed, returns a descriptive message.

matches() → list[FormattedString]#: Returns the strings that matched the token, best matches first.

ok() → bool#: True if exactly one string decoded the arguments successfully.

class pw_tokenizer.detokenize.Detokenizer(*token_database_or_elf, show_errors: bool = False, prefix: str | bytes = '$')#

Main detokenization class; detokenizes strings and caches results.

__init__(*token_database_or_elf, show_errors: bool = False, prefix: str | bytes = '$')#

Decodes and detokenizes binary messages.

Parameters:

*token_database_or_elf – a path or file object for an ELF or CSV database, a tokens.Database, or an elf_reader.Elf
prefix – one-character byte string that signals the start of a message
show_errors – if True, an error message is used in place of the % conversion specifier when an argument fails to decode

detokenize( encoded_message: bytes, domain: str | None = None, recursion: int = 5, ) → DetokenizedString#: Decodes and detokenizes a message as a DetokenizedString.

detokenize_text(data: AnyStr, recursion: int = 5) → AnyStr#

Decodes and replaces prefixed Base64 messages in the provided data.

Parameters:

data – the binary data to decode
recursion – how many levels to recursively decode

Returns:

copy of the data with all recognized tokens decoded

detokenize_text_live( input_file: RawIOBase | BinaryIO, output: BinaryIO, recursion: int = 5, ) → None#: Reads chars one-at-a-time, decoding messages; SLOW for big files.

detokenize_text_to_file(data: AnyStr, output: BinaryIO, recursion: int = 5) → None#: Decodes prefixed Base64 messages in data; decodes to output file.

lookup(token: int) → list[_TokenizedFormatString]#: Returns (TokenizedStringEntry, FormatString) list for matches.

class pw_tokenizer.detokenize.NestedMessageParser( prefix: str | bytes = '$', chars: str | bytes = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789+/-_=', )#

Parses nested tokenized messages from a byte stream or string.

__init__( prefix: str | bytes = '$', chars: str | bytes = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789+/-_=', ) → None#

Initializes a parser.

Parameters:

prefix – one character that signifies the start of a message ($).
chars – characters allowed in a message

read_messages(chunk: bytes, *, flush: bool = False) → Iterator[tuple[bool, bytes]]#

Reads prefixed messages from a byte string.

This function may be called repeatedly with chunks of a stream. Partial messages are preserved between calls, unless flush=True.

Parameters:

chunk – byte string that may contain nested messagses
flush – whether to flush any incomplete messages after processing this chunk

Yields:

(is_message, contents) chunks.

read_messages_io(binary_io: RawIOBase | BinaryIO) → Iterator[tuple[bool, bytes]]#

Reads prefixed messages from a byte stream (BinaryIO object).

Reads until EOF. If the stream is nonblocking (read(1) returns None), then this function returns and may be called again with the same IO object to continue parsing. Partial messages are preserved between calls.

Yields:: (is_message, contents) chunks.

transform(chunk: bytes, transform: Callable[[bytes], bytes], *, flush: bool = False) → bytes#

Yields the chunk with a transformation applied to the messages.

Partial messages are preserved between calls unless flush=True.

transform_io( binary_io: RawIOBase | BinaryIO, transform: Callable[[bytes], bytes], ) → Iterator[bytes]#: Yields the file with a transformation applied to the messages.

Utilities for working with tokenized fields in protobufs.

pw_tokenizer.proto.decode_optionally_tokenized(detokenizer: Detokenizer | None, data: bytes) → str#

Decodes data that may be plain text or binary / Base64 tokenized text.

Parameters:

detokenizer – detokenizer to use; if None, binary logs as Base64 encoded
data – encoded text or binary data

pw_tokenizer.proto.detokenize_fields(detokenizer: Detokenizer | None, proto: Message) → None#

Detokenizes fields annotated as tokenized in the given proto.

The fields are replaced with their detokenized version in the proto. Tokenized fields are bytes fields, so the detokenized string is stored as bytes. Call .decode() to convert the detokenized string from bytes to str.