Tokenized fields in protocol buffers

Text may be represented in a few different ways:

  • Plain ASCII or UTF-8 text (This is plain text)

  • Base64-encoded tokenized message ($ibafcA==)

  • Binary-encoded tokenized message (89 b6 9f 70)

  • Little-endian 32-bit integer token (0x709fb689)

pw_tokenizer provides tools for working with protobuf fields that may contain tokenized text.

Tokenized field protobuf option

pw_tokenizer provides the pw.tokenizer.format protobuf field option. This option may be applied to a protobuf field to indicate that it may contain a tokenized string. A string that is optionally tokenized is represented with a single bytes field annotated with (pw.tokenizer.format) = TOKENIZATION_OPTIONAL.

For example, the following protobuf has one field that may contain a tokenized string.

message MessageWithOptionallyTokenizedField {
  bytes just_bytes = 1;
  bytes maybe_tokenized = 2 [(pw.tokenizer.format) = TOKENIZATION_OPTIONAL];
  string just_text = 3;
}

Decoding optionally tokenized strings

The encoding used for an optionally tokenized field is not recorded in the protobuf. Despite this, the text can reliably be decoded. This is accomplished by attempting to decode the field as binary or Base64 tokenized data before treating it like plain text.

The following diagram describes the decoding process for optionally tokenized fields in detail.

flowchart TD start([Received bytes]) --> binary binary[Decode as<br>binary tokenized] --> binary_ok binary_ok{Detokenizes<br>successfully?} -->|no| utf8 binary_ok -->|yes| done_binary([Display decoded binary]) utf8[Decode as UTF-8] --> utf8_ok utf8_ok{Valid UTF-8?} -->|no| base64_encode utf8_ok -->|yes| base64 base64_encode[Encode as<br>tokenized Base64] --> display display([Display encoded Base64]) base64[Decode as<br>Base64 tokenized] --> base64_ok base64_ok{Fully<br>or partially<br>detokenized?} -->|no| is_plain_text base64_ok -->|yes| base64_results is_plain_text{Text is<br>printable?} -->|no| base64_encode is_plain_text-->|yes| plain_text base64_results([Display decoded Base64]) plain_text([Display text])

Potential decoding problems

The decoding process for optionally tokenized fields will yield correct results in almost every situation. In rare circumstances, it is possible for it to fail, but these can be avoided with a low-overhead mitigation if desired.

There are two ways in which the decoding process may fail.

Accidentally interpreting plain text as tokenized binary

If a plain-text string happens to decode as a binary tokenized message, the incorrect message could be displayed. This is very unlikely to occur. While many tokens will incidentally end up being valid UTF-8 strings, it is highly unlikely that a device will happen to log one of these strings as plain text. The overwhelming majority of these strings will be nonsense.

If an implementation wishes to guard against this extremely improbable situation, it is possible to prevent it. This situation is prevented by appending 0xFF (or another byte never valid in UTF-8) to binary tokenized data that happens to be valid UTF-8 (or all binary tokenized messages, if desired). When decoding, if there is an extra 0xFF byte, it is discarded.

Displaying undecoded binary as plain text instead of Base64

If a message fails to decode as binary tokenized and it is not valid UTF-8, it is displayed as tokenized Base64. This makes it easily recognizable as a tokenized message and makes it simple to decode later from the text output (for example, with an updated token database).

A binary message for which the token is not known may coincidentally be valid UTF-8 or ASCII. 6.25% of 4-byte sequences are composed only of ASCII characters. When decoding with an out-of-date token database, it is possible that some binary tokenized messages will be displayed as plain text rather than tokenized Base64.

This situation is likely to occur, but should be infrequent. Even if it does happen, it is not a serious issue. A very small number of strings will be displayed incorrectly, but these strings cannot be decoded anyway. One nonsense string (e.g. a-D1) would be displayed instead of another ($YS1EMQ==). Updating the token database would resolve the issue, though the non-Base64 logs would be difficult decode later from a log file.

This situation can be avoided with the same approach described in Accidentally interpreting plain text as tokenized binary. Appending an invalid UTF-8 character prevents the undecoded binary message from being interpreted as plain text.

Python library

The pw_tokenizer.proto module defines functions that may be used to detokenize protobuf objects in Python. The function pw_tokenizer.proto.detokenize_fields() detokenizes all fields annotated as tokenized, replacing them with their detokenized version. For example:

my_detokenizer = pw_tokenizer.Detokenizer(some_database)

my_message = SomeMessage(tokenized_field=b'$YS1EMQ==')
pw_tokenizer.proto.detokenize_fields(my_detokenizer, my_message)

assert my_message.tokenized_field == b'The detokenized string! Cool!'

pw_tokenizer.proto

Utilities for working with tokenized fields in protobufs.

pw_tokenizer.proto.decode_optionally_tokenized(detokenizer: pw_tokenizer.detokenize.Detokenizer, data: bytes)str

Decodes data that may be plain text or binary / Base64 tokenized text.

pw_tokenizer.proto.detokenize_fields(detokenizer: pw_tokenizer.detokenize.Detokenizer, proto: google.protobuf.message.Message)None

Detokenizes fields annotated as tokenized in the given proto.

The fields are replaced with their detokenized version in the proto. Tokenized fields are bytes fields, so the detokenized string is stored as bytes. Call .decode() to convert the detokenized string from bytes to str.