Detokenization#
pw_tokenizer: Compress strings to shrink logs by +75%
Detokenization is the process of expanding a token to the string it represents
and decoding its arguments. pw_tokenizer
provides Python, C++ and
TypeScript detokenization libraries.
Example: decoding tokenized logs#
A project might tokenize its log messages with the Encoding Base64. Consider the following log file, which has four tokenized logs and one plain text log:
20200229 14:38:58 INF $HL2VHA==
20200229 14:39:00 DBG $5IhTKg==
20200229 14:39:20 DBG Crunching numbers to calculate probability of success
20200229 14:39:21 INF $EgFj8lVVAUI=
20200229 14:39:23 ERR $DFRDNwlOT1RfUkVBRFk=
The project’s log strings are stored in a database like the following:
1c95bd1c, ,"Initiating retrieval process for recovery object"
2a5388e4, ,"Determining optimal approach and coordinating vectors"
3743540c, ,"Recovery object retrieval failed with status %s"
f2630112, ,"Calculated acceptable probability of success (%.2f%%)"
Using the detokenizing tools with the database, the logs can be decoded:
20200229 14:38:58 INF Initiating retrieval process for recovery object
20200229 14:39:00 DBG Determining optimal algorithm and coordinating approach vectors
20200229 14:39:20 DBG Crunching numbers to calculate probability of success
20200229 14:39:21 INF Calculated acceptable probability of success (32.33%)
20200229 14:39:23 ERR Recovery object retrieval failed with status NOT_READY
Note
This example uses the Encoding Base64, which occupies about 4/3 (133%) as much space as the default binary format when encoded. For projects that wish to interleave tokenized with plain text, using Base64 is a worthwhile tradeoff.
Detokenization in Python#
To detokenize in Python, import Detokenizer
from the pw_tokenizer
package, and instantiate it with paths to token databases or ELF files.
import pw_tokenizer
detokenizer = pw_tokenizer.Detokenizer('path/to/database.csv', 'other/path.elf')
def process_log_message(log_message):
result = detokenizer.detokenize(log_message.payload)
self._log(str(result))
The pw_tokenizer
package also provides the AutoUpdatingDetokenizer
class, which can be used in place of the standard Detokenizer
. This class
monitors database files for changes and automatically reloads them when they
change. This is helpful for long-running tools that use detokenization. The
class also supports filtering token domains for the given database files in the
<path>#<domain>
format.
For messages that are optionally tokenized and may be encoded as binary,
Base64, or plaintext UTF-8, use
pw_tokenizer.proto.decode_optionally_tokenized()
. This will attempt to
determine the correct method to detokenize and always provide a printable
string.
Decoding Base64#
The Python Detokenizer
class supports decoding and detokenizing prefixed
Base64 messages with detokenize_base64
and related methods.
Tip
The Python detokenization tools support recursive detokenization for prefixed
Base64 text. Tokenized strings found in detokenized text are detokenized, so
prefixed Base64 messages can be passed as %s
arguments.
For example, the tokenized string for “Wow!” is $RhYjmQ==
. This could be
passed as an argument to the printf-style string Nested message: %s
, which
encodes to $pEVTYQkkUmhZam1RPT0=
. The detokenizer would decode the message
as follows:
"$pEVTYQkkUmhZam1RPT0=" → "Nested message: $RhYjmQ==" → "Nested message: Wow!"
Base64 decoding is supported in C++ or C with the
pw::tokenizer::PrefixedBase64Decode
or pw_tokenizer_PrefixedBase64Decode
functions.
Investigating undecoded Base64 messages#
Tokenized messages cannot be decoded if the token is not recognized. The Python
package includes the parse_message
tool, which parses tokenized Base64
messages without looking up the token in a database. This tool attempts to guess
the types of the arguments and displays potential ways to decode them.
This tool can be used to extract argument information from an otherwise unusable message. It could help identify which statement in the code produced the message. This tool is not particularly helpful for tokenized messages without arguments, since all it can do is show the value of the unknown token.
The tool is executed by passing Base64 tokenized messages, with or without the
$
prefix, to pw_tokenizer.parse_message
. Pass -h
or --help
to
see full usage information.
Example#
$ python -m pw_tokenizer.parse_message '$329JMwA=' koSl524TRkFJTEVEX1BSRUNPTkRJVElPTgJPSw== --specs %s %d
INF Decoding arguments for '$329JMwA='
INF Binary: b'\xdfoI3\x00' [df 6f 49 33 00] (5 bytes)
INF Token: 0x33496fdf
INF Args: b'\x00' [00] (1 bytes)
INF Decoding with up to 8 %s or %d arguments
INF Attempt 1: [%s]
INF Attempt 2: [%d] 0
INF Decoding arguments for '$koSl524TRkFJTEVEX1BSRUNPTkRJVElPTgJPSw=='
INF Binary: b'\x92\x84\xa5\xe7n\x13FAILED_PRECONDITION\x02OK' [92 84 a5 e7 6e 13 46 41 49 4c 45 44 5f 50 52 45 43 4f 4e 44 49 54 49 4f 4e 02 4f 4b] (28 bytes)
INF Token: 0xe7a58492
INF Args: b'n\x13FAILED_PRECONDITION\x02OK' [6e 13 46 41 49 4c 45 44 5f 50 52 45 43 4f 4e 44 49 54 49 4f 4e 02 4f 4b] (24 bytes)
INF Decoding with up to 8 %s or %d arguments
INF Attempt 1: [%d %s %d %d %d] 55 FAILED_PRECONDITION 1 -40 -38
INF Attempt 2: [%d %s %s] 55 FAILED_PRECONDITION OK
Detokenizing protobufs#
The pw_tokenizer.proto
Python module defines functions that may be
used to detokenize protobuf objects in Python. The function
pw_tokenizer.proto.detokenize_fields()
detokenizes all fields
annotated as tokenized, replacing them with their detokenized version. For
example:
my_detokenizer = pw_tokenizer.Detokenizer(some_database)
my_message = SomeMessage(tokenized_field=b'$YS1EMQ==')
pw_tokenizer.proto.detokenize_fields(my_detokenizer, my_message)
assert my_message.tokenized_field == b'The detokenized string! Cool!'
Decoding optionally tokenized strings#
The encoding used for an optionally tokenized field is not recorded in the protobuf. Despite this, the text can reliably be decoded. This is accomplished by attempting to decode the field as binary or Base64 tokenized data before treating it like plain text.
The following diagram describes the decoding process for optionally tokenized fields in detail.
Potential decoding problems#
The decoding process for optionally tokenized fields will yield correct results in almost every situation. In rare circumstances, it is possible for it to fail, but these can be avoided with a low-overhead mitigation if desired.
There are two ways in which the decoding process may fail.
Accidentally interpreting plain text as tokenized binary#
If a plain-text string happens to decode as a binary tokenized message, the incorrect message could be displayed. This is very unlikely to occur. While many tokens will incidentally end up being valid UTF-8 strings, it is highly unlikely that a device will happen to log one of these strings as plain text. The overwhelming majority of these strings will be nonsense.
If an implementation wishes to guard against this extremely improbable situation, it is possible to prevent it. This situation is prevented by appending 0xFF (or another byte never valid in UTF-8) to binary tokenized data that happens to be valid UTF-8 (or all binary tokenized messages, if desired). When decoding, if there is an extra 0xFF byte, it is discarded.
Displaying undecoded binary as plain text instead of Base64#
If a message fails to decode as binary tokenized and it is not valid UTF-8, it is displayed as tokenized Base64. This makes it easily recognizable as a tokenized message and makes it simple to decode later from the text output (for example, with an updated token database).
A binary message for which the token is not known may coincidentally be valid UTF-8 or ASCII. 6.25% of 4-byte sequences are composed only of ASCII characters When decoding with an out-of-date token database, it is possible that some binary tokenized messages will be displayed as plain text rather than tokenized Base64.
This situation is likely to occur, but should be infrequent. Even if it does
happen, it is not a serious issue. A very small number of strings will be
displayed incorrectly, but these strings cannot be decoded anyway. One nonsense
string (e.g. a-D1
) would be displayed instead of another ($YS1EMQ==
).
Updating the token database would resolve the issue, though the non-Base64 logs
would be difficult decode later from a log file.
This situation can be avoided with the same approach described in Accidentally interpreting plain text as tokenized binary. Appending an invalid UTF-8 character prevents the undecoded binary message from being interpreted as plain text.
Detokenization in C++#
The C++ detokenization libraries can be used in C++ or any language that can call into C++ with a C-linkage wrapper, such as Java or Rust. A reference Java Native Interface (JNI) implementation is provided.
The C++ detokenization library uses binary-format token databases (created with
database.py create --type binary
). Read a binary format database from a
file or include it in the source code. Pass the database array to
TokenDatabase::Create
, and construct a detokenizer.
Detokenizer detokenizer(TokenDatabase::Create(token_database_array));
std::string ProcessLog(span<uint8_t> log_data) {
return detokenizer.Detokenize(log_data).BestString();
}
The TokenDatabase
class verifies that its data is valid before using it. If
it is invalid, the TokenDatabase::Create
returns an empty database for which
ok()
returns false. If the token database is included in the source code,
this check can be done at compile time.
// This line fails to compile with a static_assert if the database is invalid.
constexpr TokenDatabase kDefaultDatabase = TokenDatabase::Create<kData>();
Detokenizer OpenDatabase(std::string_view path) {
std::vector<uint8_t> data = ReadWholeFile(path);
TokenDatabase database = TokenDatabase::Create(data);
// This checks if the file contained a valid database. It is safe to use a
// TokenDatabase that failed to load (it will be empty), but it may be
// desirable to provide a default database or otherwise handle the error.
if (database.ok()) {
return Detokenizer(database);
}
return Detokenizer(kDefaultDatabase);
}
Detokenization in TypeScript#
To detokenize in TypeScript, import Detokenizer
from the pigweedjs
package, and instantiate it with a CSV token database.
import { pw_tokenizer, pw_hdlc } from 'pigweedjs';
const { Detokenizer } = pw_tokenizer;
const { Frame } = pw_hdlc;
const detokenizer = new Detokenizer(String(tokenCsv));
function processLog(frame: Frame){
const result = detokenizer.detokenize(frame);
console.log(result);
}
For messages that are encoded in Base64, use Detokenizer::detokenizeBase64
.
detokenizeBase64 will also attempt to detokenize nested Base64 tokens. There
is also detokenizeUint8Array that works just like detokenize but expects
Uint8Array instead of a Frame argument.
Detokenizing CLI tool#
pw_tokenizer
provides two standalone command line utilities for detokenizing
Base64-encoded tokenized strings.
detokenize.py
– Detokenizes Base64-encoded strings in files or from stdin.serial_detokenizer.py
– Detokenizes Base64-encoded strings from a connected serial device.
If the pw_tokenizer
Python package is installed, these tools may be executed
as runnable modules. For example:
# Detokenize Base64-encoded strings in a file
python -m pw_tokenizer.detokenize -i input_file.txt
# Detokenize Base64-encoded strings in output from a serial device
python -m pw_tokenizer.serial_detokenizer --device /dev/ttyACM0
See the --help
options for these tools for full usage information.
Appendix#
Python detokenization: C99 printf
compatibility notes#
This implementation is designed to align with the C99 specification, section 7.19.6. Notably, this specification is slightly different than what is implemented in most compilers due to each compiler choosing to interpret undefined behavior in slightly different ways. Treat the following description as the source of truth.
This implementation supports:
Overall Format:
%[flags][width][.precision][length][specifier]
- Flags (Zero or More)
-
: Left-justify within the given field width; Right justification is the default (see Width modifier).+
: Forces to preceed the result with a plus or minus sign (+
or-
) even for positive numbers. By default, only negative numbers are preceded with a-
sign.(space): If no sign is going to be written, a blank space is inserted before the value.
#
: Specifies an alternative print syntax should be used.Used with
o
,x
orX
specifiers the value is preceeded with0
,0x
or0X
, respectively, for values different than zero.Used with
a
,A
,e
,E
,f
,F
,g
, orG
it forces the written output to contain a decimal point even if no more digits follow. By default, if no digits follow, no decimal point is written.
0
: Left-pads the number with zeroes (0
) instead of spaces when padding is specified (see width sub-specifier).
- Width (Optional)
(number)
: Minimum number of characters to be printed. If the value to be printed is shorter than this number, the result is padded with blank spaces or0
if the0
flag is present. The value is not truncated even if the result is larger. If the value is negative and the0
flag is present, the0
s are padded after the-
symbol.*
: The width is not specified in the format string, but as an additional integer value argument preceding the argument that has to be formatted.
- Precision (Optional)
.(number)
For
d
,i
,o
,u
,x
,X
, specifies the minimum number of digits to be written. If the value to be written is shorter than this number, the result is padded with leading zeros. The value is not truncated even if the result is longer.A precision of
0
means that no character is written for the value0
.
For
a
,A
,e
,E
,f
, andF
, specifies the number of digits to be printed after the decimal point. By default, this is6
.For
g
andG
, specifies the maximum number of significant digits to be printed.For
s
, specifies the maximum number of characters to be printed. By default all characters are printed until the ending null character is encountered.If the period is specified without an explicit value for precision,
0
is assumed.
.*
: The precision is not specified in the format string, but as an additional integer value argument preceding the argument that has to be formatted.
- Length (Optional)
hh
: Usable withd
,i
,o
,u
,x
, orX
specifiers to convey the argument will be asigned char
orunsigned char
. However, this is largely ignored in the implementation due to it not being necessary for Python or argument decoding (since the argument is always encoded at least as a 32-bit integer).h
: Usable withd
,i
,o
,u
,x
, orX
specifiers to convey the argument will be asigned short int
orunsigned short int
. However, this is largely ignored in the implementation due to it not being necessary for Python or argument decoding (since the argument is always encoded at least as a 32-bit integer).l
: Usable withd
,i
,o
,u
,x
, orX
specifiers to convey the argument will be asigned long int
orunsigned long int
. Also is usable withc
ands
to specify that the arguments will be encoded withwchar_t
values (which isn’t different from normalchar
values). However, this is largely ignored in the implementation due to it not being necessary for Python or argument decoding (since the argument is always encoded at least as a 32-bit integer).ll
: Usable withd
,i
,o
,u
,x
, orX
specifiers to convey the argument will be asigned long long int
orunsigned long long int
. This is required to properly decode the argument as a 64-bit integer.L
: Usable witha
,A
,e
,E
,f
,F
,g
, orG
conversion specifiers applies to a long double argument. However, this is ignored in the implementation due to floating point value encoded that is unaffected by bit width.j
: Usable withd
,i
,o
,u
,x
, orX
specifiers to convey the argument will be aintmax_t
oruintmax_t
.z
: Usable withd
,i
,o
,u
,x
, orX
specifiers to convey the argument will be asize_t
. This will force the argument to be decoded as an unsigned integer.t
: Usable withd
,i
,o
,u
,x
, orX
specifiers to convey the argument will be aptrdiff_t
.If a length modifier is provided for an incorrect specifier, it is ignored.
- Specifier (Required)
d
/i
: Used for signed decimal integers.u
: Used for unsigned decimal integers.o
: Used for unsigned decimal integers and specifies formatting should be as an octal number.x
: Used for unsigned decimal integers and specifies formatting should be as a hexadecimal number using all lowercase letters.X
: Used for unsigned decimal integers and specifies formatting should be as a hexadecimal number using all uppercase letters.f
: Used for floating-point values and specifies to use lowercase, decimal floating point formatting.Default precision is
6
decimal places unless explicitly specified.
F
: Used for floating-point values and specifies to use uppercase, decimal floating point formatting.Default precision is
6
decimal places unless explicitly specified.
e
: Used for floating-point values and specifies to use lowercase, exponential (scientific) formatting.Default precision is
6
decimal places unless explicitly specified.
E
: Used for floating-point values and specifies to use uppercase, exponential (scientific) formatting.Default precision is
6
decimal places unless explicitly specified.
g
: Used for floating-point values and specified to usef
ore
formatting depending on which would be the shortest representation.Precision specifies the number of significant digits, not just digits after the decimal place.
If the precision is specified as
0
, it is interpreted to mean1
.e
formatting is used if the the exponent would be less than-4
or is greater than or equal to the precision.Trailing zeros are removed unless the
#
flag is set.A decimal point only appears if it is followed by a digit.
NaN
or infinities always followf
formatting.
G
: Used for floating-point values and specified to usef
ore
formatting depending on which would be the shortest representation.Precision specifies the number of significant digits, not just digits after the decimal place.
If the precision is specified as
0
, it is interpreted to mean1
.E
formatting is used if the the exponent would be less than-4
or is greater than or equal to the precision.Trailing zeros are removed unless the
#
flag is set.A decimal point only appears if it is followed by a digit.
NaN
or infinities always followF
formatting.
c
: Used for formatting achar
value.s
: Used for formatting a string ofchar
values.If width is specified, the null terminator character is included as a character for width count.
If precision is specified, no more
char
s than that value will be written from the string (padding is used to fill additional width).
p
: Used for formatting a pointer address.%
: Prints a single%
. Only valid as%%
(supports no flags, width, precision, or length modifiers).
Underspecified details:
If both
+
and (space) flags appear, the (space) is ignored.The
+
and (space) flags will error if used withc
ors
.The
#
flag will error if used withd
,i
,u
,c
,s
, orp
.The
0
flag will error if used withc
,s
, orp
.Both
+
and (space) can work with the unsigned integer specifiersu
,o
,x
, andX
.If a length modifier is provided for an incorrect specifier, it is ignored.
The
z
length modifier will decode arugments as signed as long asd
ori
is used.p
is implementation defined.For this implementation, it will print with a
0x
prefix and then the pointer value was printed using%08X
.p
supports the+
,-
, and (space) flags, but not the#
or0
flags.None of the length modifiers are usable with
p
.This implementation will try to adhere to user-specified width (assuming the width provided is larger than the guaranteed minimum of
10
).Specifying precision for
p
is considered an error.
Only
%%
is allowed with no other modifiers. Things like%+%
will fail to decode. Some C stdlib implementations support any modifiers being present between%
, but ignore any for the output.If a width is specified with the
0
flag for a negative value, the padded0
s will appear after the-
symbol.A precision of
0
ford
,i
,u
,o
,x
, orX
means that no character is written for the value0
.Precision cannot be specified for
c
.Using
*
or fixed precision with thes
specifier still requires the string argument to be null-terminated. This is due to argument encoding happening on the C/C++-side while the precision value is not read or otherwise used until decoding happens in this Python code.
Non-conformant details:
n
specifier: We do not support then
specifier since it is impossible for us to retroactively tell the original program how many characters have been printed since this decoding happens a great deal of time after the device sent it, usually on a separate processing device entirely.