Detokenization#

pw_tokenizer: Compress strings to shrink logs by +75%

Detokenization is the process of expanding a token to the string it represents and decoding its arguments. pw_tokenizer provides Python, C++ and TypeScript detokenization libraries.

Example: decoding tokenized logs#

A project might tokenize its log messages with the Encoding Base64. Consider the following log file, which has four tokenized logs and one plain text log:

20200229 14:38:58 INF $HL2VHA==
20200229 14:39:00 DBG $5IhTKg==
20200229 14:39:20 DBG Crunching numbers to calculate probability of success
20200229 14:39:21 INF $EgFj8lVVAUI=
20200229 14:39:23 ERR $DFRDNwlOT1RfUkVBRFk=

The project’s log strings are stored in a database like the following:

1c95bd1c,          ,"Initiating retrieval process for recovery object"
2a5388e4,          ,"Determining optimal approach and coordinating vectors"
3743540c,          ,"Recovery object retrieval failed with status %s"
f2630112,          ,"Calculated acceptable probability of success (%.2f%%)"

Using the detokenizing tools with the database, the logs can be decoded:

20200229 14:38:58 INF Initiating retrieval process for recovery object
20200229 14:39:00 DBG Determining optimal algorithm and coordinating approach vectors
20200229 14:39:20 DBG Crunching numbers to calculate probability of success
20200229 14:39:21 INF Calculated acceptable probability of success (32.33%)
20200229 14:39:23 ERR Recovery object retrieval failed with status NOT_READY

Note

This example uses the Encoding Base64, which occupies about 4/3 (133%) as much space as the default binary format when encoded. For projects that wish to interleave tokenized with plain text, using Base64 is a worthwhile tradeoff.

Detokenization in Python#

To detokenize in Python, import Detokenizer from the pw_tokenizer package, and instantiate it with paths to token databases or ELF files.

import pw_tokenizer

detokenizer = pw_tokenizer.Detokenizer('path/to/database.csv', 'other/path.elf')

def process_log_message(log_message):
    result = detokenizer.detokenize(log_message.payload)
    self._log(str(result))

The pw_tokenizer package also provides the AutoUpdatingDetokenizer class, which can be used in place of the standard Detokenizer. This class monitors database files for changes and automatically reloads them when they change. This is helpful for long-running tools that use detokenization. The class also supports token domains for the given database files in the <path>#<domain> format.

For messages that are optionally tokenized and may be encoded as binary, Base64, or plaintext UTF-8, use pw_tokenizer.proto.decode_optionally_tokenized(). This will attempt to determine the correct method to detokenize and always provide a printable string.

Decoding Base64#

The Python Detokenizer class supports decoding and detokenizing prefixed Base64 messages with detokenize_base64 and related methods.

Tip

The Python detokenization tools support recursive detokenization for prefixed Base64 text. Tokenized strings found in detokenized text are detokenized, so prefixed Base64 messages can be passed as %s arguments.

For example, the tokenized string for β€œWow!” is $RhYjmQ==. This could be passed as an argument to the printf-style string Nested message: %s, which encodes to $pEVTYQkkUmhZam1RPT0=. The detokenizer would decode the message as follows:

"$pEVTYQkkUmhZam1RPT0=" β†’ "Nested message: $RhYjmQ==" β†’ "Nested message: Wow!"

Base64 decoding is supported in C++ or C with the pw::tokenizer::PrefixedBase64Decode or pw_tokenizer_PrefixedBase64Decode functions.

Investigating undecoded Base64 messages#

Tokenized messages cannot be decoded if the token is not recognized. The Python package includes the parse_message tool, which parses tokenized Base64 messages without looking up the token in a database. This tool attempts to guess the types of the arguments and displays potential ways to decode them.

This tool can be used to extract argument information from an otherwise unusable message. It could help identify which statement in the code produced the message. This tool is not particularly helpful for tokenized messages without arguments, since all it can do is show the value of the unknown token.

The tool is executed by passing Base64 tokenized messages, with or without the $ prefix, to pw_tokenizer.parse_message. Pass -h or --help to see full usage information.

Example#
$ python -m pw_tokenizer.parse_message '$329JMwA=' koSl524TRkFJTEVEX1BSRUNPTkRJVElPTgJPSw== --specs %s %d

INF Decoding arguments for '$329JMwA='
INF Binary: b'\xdfoI3\x00' [df 6f 49 33 00] (5 bytes)
INF Token:  0x33496fdf
INF Args:   b'\x00' [00] (1 bytes)
INF Decoding with up to 8 %s or %d arguments
INF   Attempt 1: [%s]
INF   Attempt 2: [%d] 0

INF Decoding arguments for '$koSl524TRkFJTEVEX1BSRUNPTkRJVElPTgJPSw=='
INF Binary: b'\x92\x84\xa5\xe7n\x13FAILED_PRECONDITION\x02OK' [92 84 a5 e7 6e 13 46 41 49 4c 45 44 5f 50 52 45 43 4f 4e 44 49 54 49 4f 4e 02 4f 4b] (28 bytes)
INF Token:  0xe7a58492
INF Args:   b'n\x13FAILED_PRECONDITION\x02OK' [6e 13 46 41 49 4c 45 44 5f 50 52 45 43 4f 4e 44 49 54 49 4f 4e 02 4f 4b] (24 bytes)
INF Decoding with up to 8 %s or %d arguments
INF   Attempt 1: [%d %s %d %d %d] 55 FAILED_PRECONDITION 1 -40 -38
INF   Attempt 2: [%d %s %s] 55 FAILED_PRECONDITION OK

Detokenizing protobufs#

The pw_tokenizer.proto Python module defines functions that may be used to detokenize protobuf objects in Python. The function pw_tokenizer.proto.detokenize_fields() detokenizes all fields annotated as tokenized, replacing them with their detokenized version. For example:

my_detokenizer = pw_tokenizer.Detokenizer(some_database)

my_message = SomeMessage(tokenized_field=b'$YS1EMQ==')
pw_tokenizer.proto.detokenize_fields(my_detokenizer, my_message)

assert my_message.tokenized_field == b'The detokenized string! Cool!'

Decoding optionally tokenized strings#

The encoding used for an optionally tokenized field is not recorded in the protobuf. Despite this, the text can reliably be decoded. This is accomplished by attempting to decode the field as binary or Base64 tokenized data before treating it like plain text.

The following diagram describes the decoding process for optionally tokenized fields in detail.

flowchart TD start([Received bytes]) --> binary binary[Decode as<br>binary tokenized] --> binary_ok binary_ok{Detokenizes<br>successfully?} -->|no| utf8 binary_ok -->|yes| done_binary([Display decoded binary]) utf8[Decode as UTF-8] --> utf8_ok utf8_ok{Valid UTF-8?} -->|no| base64_encode utf8_ok -->|yes| base64 base64_encode[Encode as<br>tokenized Base64] --> display display([Display encoded Base64]) base64[Decode as<br>Base64 tokenized] --> base64_ok base64_ok{Fully<br>or partially<br>detokenized?} -->|no| is_plain_text base64_ok -->|yes| base64_results is_plain_text{Text is<br>printable?} -->|no| base64_encode is_plain_text-->|yes| plain_text base64_results([Display decoded Base64]) plain_text([Display text])

Potential decoding problems#

The decoding process for optionally tokenized fields will yield correct results in almost every situation. In rare circumstances, it is possible for it to fail, but these can be avoided with a low-overhead mitigation if desired.

There are two ways in which the decoding process may fail.

Accidentally interpreting plain text as tokenized binary#

If a plain-text string happens to decode as a binary tokenized message, the incorrect message could be displayed. This is very unlikely to occur. While many tokens will incidentally end up being valid UTF-8 strings, it is highly unlikely that a device will happen to log one of these strings as plain text. The overwhelming majority of these strings will be nonsense.

If an implementation wishes to guard against this extremely improbable situation, it is possible to prevent it. This situation is prevented by appending 0xFF (or another byte never valid in UTF-8) to binary tokenized data that happens to be valid UTF-8 (or all binary tokenized messages, if desired). When decoding, if there is an extra 0xFF byte, it is discarded.

Displaying undecoded binary as plain text instead of Base64#

If a message fails to decode as binary tokenized and it is not valid UTF-8, it is displayed as tokenized Base64. This makes it easily recognizable as a tokenized message and makes it simple to decode later from the text output (for example, with an updated token database).

A binary message for which the token is not known may coincidentally be valid UTF-8 or ASCII. 6.25% of 4-byte sequences are composed only of ASCII characters When decoding with an out-of-date token database, it is possible that some binary tokenized messages will be displayed as plain text rather than tokenized Base64.

This situation is likely to occur, but should be infrequent. Even if it does happen, it is not a serious issue. A very small number of strings will be displayed incorrectly, but these strings cannot be decoded anyway. One nonsense string (e.g. a-D1) would be displayed instead of another ($YS1EMQ==). Updating the token database would resolve the issue, though the non-Base64 logs would be difficult decode later from a log file.

This situation can be avoided with the same approach described in Accidentally interpreting plain text as tokenized binary. Appending an invalid UTF-8 character prevents the undecoded binary message from being interpreted as plain text.

Detokenization in C++#

The C++ detokenization libraries can be used in C++ or any language that can call into C++ with a C-linkage wrapper, such as Java or Rust. A reference Java Native Interface (JNI) implementation is provided.

The C++ detokenization library uses binary-format token databases (created with database.py create --type binary). Read a binary format database from a file or include it in the source code. Pass the database array to TokenDatabase::Create, and construct a detokenizer.

Detokenizer detokenizer(TokenDatabase::Create(token_database_array));

std::string ProcessLog(span<uint8_t> log_data) {
  return detokenizer.Detokenize(log_data).BestString();
}

The TokenDatabase class verifies that its data is valid before using it. If it is invalid, the TokenDatabase::Create returns an empty database for which ok() returns false. If the token database is included in the source code, this check can be done at compile time.

// This line fails to compile with a static_assert if the database is invalid.
constexpr TokenDatabase kDefaultDatabase =  TokenDatabase::Create<kData>();

Detokenizer OpenDatabase(std::string_view path) {
  std::vector<uint8_t> data = ReadWholeFile(path);

  TokenDatabase database = TokenDatabase::Create(data);

  // This checks if the file contained a valid database. It is safe to use a
  // TokenDatabase that failed to load (it will be empty), but it may be
  // desirable to provide a default database or otherwise handle the error.
  if (database.ok()) {
    return Detokenizer(database);
  }
  return Detokenizer(kDefaultDatabase);
}

Detokenization in TypeScript#

To detokenize in TypeScript, import Detokenizer from the pigweedjs package, and instantiate it with a CSV token database.

import { pw_tokenizer, pw_hdlc } from 'pigweedjs';
const { Detokenizer } = pw_tokenizer;
const { Frame } = pw_hdlc;

const detokenizer = new Detokenizer(String(tokenCsv));

function processLog(frame: Frame){
  const result = detokenizer.detokenize(frame);
  console.log(result);
}

For messages that are encoded in Base64, use Detokenizer::detokenizeBase64. detokenizeBase64 will also attempt to detokenize nested Base64 tokens. There is also detokenizeUint8Array that works just like detokenize but expects Uint8Array instead of a Frame argument.

Detokenizing CLI tool#

pw_tokenizer provides two standalone command line utilities for detokenizing Base64-encoded tokenized strings.

  • detokenize.py – Detokenizes Base64-encoded strings in files or from stdin.

  • serial_detokenizer.py – Detokenizes Base64-encoded strings from a connected serial device.

If the pw_tokenizer Python package is installed, these tools may be executed as runnable modules. For example:

# Detokenize Base64-encoded strings in a file
python -m pw_tokenizer.detokenize -i input_file.txt

# Detokenize Base64-encoded strings in output from a serial device
python -m pw_tokenizer.serial_detokenizer --device /dev/ttyACM0

See the --help options for these tools for full usage information.

Appendix#

Python detokenization: C99 printf compatibility notes#

This implementation is designed to align with the C99 specification, section 7.19.6. Notably, this specification is slightly different than what is implemented in most compilers due to each compiler choosing to interpret undefined behavior in slightly different ways. Treat the following description as the source of truth.

This implementation supports:

  • Overall Format: %[flags][width][.precision][length][specifier]

  • Flags (Zero or More)
    • -: Left-justify within the given field width; Right justification is the default (see Width modifier).

    • +: Forces to preceed the result with a plus or minus sign (+ or -) even for positive numbers. By default, only negative numbers are preceded with a - sign.

    • (space): If no sign is going to be written, a blank space is inserted before the value.

    • #: Specifies an alternative print syntax should be used.
      • Used with o, x or X specifiers the value is preceeded with 0, 0x or 0X, respectively, for values different than zero.

      • Used with a, A, e, E, f, F, g, or G it forces the written output to contain a decimal point even if no more digits follow. By default, if no digits follow, no decimal point is written.

    • 0: Left-pads the number with zeroes (0) instead of spaces when padding is specified (see width sub-specifier).

  • Width (Optional)
    • (number): Minimum number of characters to be printed. If the value to be printed is shorter than this number, the result is padded with blank spaces or 0 if the 0 flag is present. The value is not truncated even if the result is larger. If the value is negative and the 0 flag is present, the 0s are padded after the - symbol.

    • *: The width is not specified in the format string, but as an additional integer value argument preceding the argument that has to be formatted.

  • Precision (Optional)
    • .(number)
      • For d, i, o, u, x, X, specifies the minimum number of digits to be written. If the value to be written is shorter than this number, the result is padded with leading zeros. The value is not truncated even if the result is longer.

        • A precision of 0 means that no character is written for the value 0.

      • For a, A, e, E, f, and F, specifies the number of digits to be printed after the decimal point. By default, this is 6.

      • For g and G, specifies the maximum number of significant digits to be printed.

      • For s, specifies the maximum number of characters to be printed. By default all characters are printed until the ending null character is encountered.

      • If the period is specified without an explicit value for precision, 0 is assumed.

    • .*: The precision is not specified in the format string, but as an additional integer value argument preceding the argument that has to be formatted.

  • Length (Optional)
    • hh: Usable with d, i, o, u, x, or X specifiers to convey the argument will be a signed char or unsigned char. However, this is largely ignored in the implementation due to it not being necessary for Python or argument decoding (since the argument is always encoded at least as a 32-bit integer).

    • h: Usable with d, i, o, u, x, or X specifiers to convey the argument will be a signed short int or unsigned short int. However, this is largely ignored in the implementation due to it not being necessary for Python or argument decoding (since the argument is always encoded at least as a 32-bit integer).

    • l: Usable with d, i, o, u, x, or X specifiers to convey the argument will be a signed long int or unsigned long int. Also is usable with c and s to specify that the arguments will be encoded with wchar_t values (which isn’t different from normal char values). However, this is largely ignored in the implementation due to it not being necessary for Python or argument decoding (since the argument is always encoded at least as a 32-bit integer).

    • ll: Usable with d, i, o, u, x, or X specifiers to convey the argument will be a signed long long int or unsigned long long int. This is required to properly decode the argument as a 64-bit integer.

    • L: Usable with a, A, e, E, f, F, g, or G conversion specifiers applies to a long double argument. However, this is ignored in the implementation due to floating point value encoded that is unaffected by bit width.

    • j: Usable with d, i, o, u, x, or X specifiers to convey the argument will be a intmax_t or uintmax_t.

    • z: Usable with d, i, o, u, x, or X specifiers to convey the argument will be a size_t. This will force the argument to be decoded as an unsigned integer.

    • t: Usable with d, i, o, u, x, or X specifiers to convey the argument will be a ptrdiff_t.

    • If a length modifier is provided for an incorrect specifier, it is ignored.

  • Specifier (Required)
    • d / i: Used for signed decimal integers.

    • u: Used for unsigned decimal integers.

    • o: Used for unsigned decimal integers and specifies formatting should be as an octal number.

    • x: Used for unsigned decimal integers and specifies formatting should be as a hexadecimal number using all lowercase letters.

    • X: Used for unsigned decimal integers and specifies formatting should be as a hexadecimal number using all uppercase letters.

    • f: Used for floating-point values and specifies to use lowercase, decimal floating point formatting.

      • Default precision is 6 decimal places unless explicitly specified.

    • F: Used for floating-point values and specifies to use uppercase, decimal floating point formatting.

      • Default precision is 6 decimal places unless explicitly specified.

    • e: Used for floating-point values and specifies to use lowercase, exponential (scientific) formatting.

      • Default precision is 6 decimal places unless explicitly specified.

    • E: Used for floating-point values and specifies to use uppercase, exponential (scientific) formatting.

      • Default precision is 6 decimal places unless explicitly specified.

    • g: Used for floating-point values and specified to use f or e formatting depending on which would be the shortest representation.

      • Precision specifies the number of significant digits, not just digits after the decimal place.

      • If the precision is specified as 0, it is interpreted to mean 1.

      • e formatting is used if the the exponent would be less than -4 or is greater than or equal to the precision.

      • Trailing zeros are removed unless the # flag is set.

      • A decimal point only appears if it is followed by a digit.

      • NaN or infinities always follow f formatting.

    • G: Used for floating-point values and specified to use f or e formatting depending on which would be the shortest representation.

      • Precision specifies the number of significant digits, not just digits after the decimal place.

      • If the precision is specified as 0, it is interpreted to mean 1.

      • E formatting is used if the the exponent would be less than -4 or is greater than or equal to the precision.

      • Trailing zeros are removed unless the # flag is set.

      • A decimal point only appears if it is followed by a digit.

      • NaN or infinities always follow F formatting.

    • c: Used for formatting a char value.

    • s: Used for formatting a string of char values.

      • If width is specified, the null terminator character is included as a character for width count.

      • If precision is specified, no more chars than that value will be written from the string (padding is used to fill additional width).

    • p: Used for formatting a pointer address.

    • %: Prints a single %. Only valid as %% (supports no flags, width, precision, or length modifiers).

Underspecified details:

  • If both + and (space) flags appear, the (space) is ignored.

  • The + and (space) flags will error if used with c or s.

  • The # flag will error if used with d, i, u, c, s, or p.

  • The 0 flag will error if used with c, s, or p.

  • Both + and (space) can work with the unsigned integer specifiers u, o, x, and X.

  • If a length modifier is provided for an incorrect specifier, it is ignored.

  • The z length modifier will decode arugments as signed as long as d or i is used.

  • p is implementation defined.

    • For this implementation, it will print with a 0x prefix and then the pointer value was printed using %08X.

    • p supports the +, -, and (space) flags, but not the # or 0 flags.

    • None of the length modifiers are usable with p.

    • This implementation will try to adhere to user-specified width (assuming the width provided is larger than the guaranteed minimum of 10).

    • Specifying precision for p is considered an error.

  • Only %% is allowed with no other modifiers. Things like %+% will fail to decode. Some C stdlib implementations support any modifiers being present between %, but ignore any for the output.

  • If a width is specified with the 0 flag for a negative value, the padded 0s will appear after the - symbol.

  • A precision of 0 for d, i, u, o, x, or X means that no character is written for the value 0.

  • Precision cannot be specified for c.

  • Using * or fixed precision with the s specifier still requires the string argument to be null-terminated. This is due to argument encoding happening on the C/C++-side while the precision value is not read or otherwise used until decoding happens in this Python code.

Non-conformant details:

  • n specifier: We do not support the n specifier since it is impossible for us to retroactively tell the original program how many characters have been printed since this decoding happens a great deal of time after the device sent it, usually on a separate processing device entirely.