Token databases#
pw_tokenizer: Compress strings to shrink logs by +75%
Token databases store a mapping of tokens to the strings they represent. An ELF file can be used as a token database, but it only contains the strings for its exact build. A token database file aggregates tokens from multiple ELF files, so that a single database can decode tokenized strings from any known ELF.
Token databases contain the token, removal date (if any), and string for each tokenized string.
Token database formats#
Three token database formats are supported: CSV, binary, and directory. Tokens
may also be read from ELF files or .a
archives, but cannot be written to
these formats.
CSV database format#
The CSV database format has four columns: the token in hexadecimal, the removal date (if any) in year-month-day format, the token domain, and the string literal. The domain and string are quoted, and quote characters within the domain or string are represented as two quote characters.
This example database contains six strings, three of which have removal dates.
141c35d5, ,"","The answer: ""%s"""
2e668cd6,2019-12-25,"","Jello, world!"
7a22c974, ,"metrics","%f"
7b940e2a, ,"","Hello %s! %hd %e"
851beeb6, ,"","%u %d"
881436a0,2020-01-01,"","The answer is: %s"
e13b0f94,2020-04-01,"metrics","%llu"
Legacy CSV databases did not include the domain, so only had three columns.
These databases are still supported, but tokens are always in the default domain
(""
).
Binary database format#
The binary database format is comprised of a 16-byte header followed by a series of 8-byte entries. Each entry stores the token and the removal date, which is 0xFFFFFFFF if there is none. The string literals are stored next in the same order as the entries. Strings are stored with null terminators. See token_database.h for full details.
The binary form of the CSV database is shown below. It contains the same information, but in a more compact and easily processed form. It takes 141 B compared with the CSV database’s 211 B.
[header]
0x00: 454b4f54 0000534e TOKENS..
0x08: 00000006 00000000 ........
[entries]
0x10: 141c35d5 ffffffff .5......
0x18: 2e668cd6 07e30c19 ..f.....
0x20: 7b940e2a ffffffff *..{....
0x28: 851beeb6 ffffffff ........
0x30: 881436a0 07e40101 .6......
0x38: e13b0f94 07e40401 ..;.....
[string table]
0x40: 54 68 65 20 61 6e 73 77 65 72 3a 20 22 25 73 22 The answer: "%s"
0x50: 00 4a 65 6c 6c 6f 2c 20 77 6f 72 6c 64 21 00 48 .Jello, world!.H
0x60: 65 6c 6c 6f 20 25 73 21 20 25 68 64 20 25 65 00 ello %s! %hd %e.
0x70: 25 75 20 25 64 00 54 68 65 20 61 6e 73 77 65 72 %u %d.The answer
0x80: 20 69 73 3a 20 25 73 00 25 6c 6c 75 00 is: %s.%llu.
Directory database format#
pw_tokenizer can consume directories of CSV databases. A directory database will be searched recursively for files with a .pw_tokenizer.csv suffix, all of which will be used for subsequent detokenization lookups.
An example directory database might look something like this:
directory_token_database
├── database.pw_tokenizer.csv
├── 9a8906c30d7c4abaa788de5634d2fa25.pw_tokenizer.csv
└── b9aff81a03ad4d8a82a250a737285454.pw_tokenizer.csv
This format is optimized for storage in a Git repository alongside source code.
The token database commands randomly generate unique file names for the CSVs in
the database to prevent merge conflicts. Running mark_removed
or purge
commands in the database CLI consolidates the files to a single CSV.
The database command line tool supports a --discard-temporary
<upstream_commit>
option for add
. In this mode, the tool attempts to
discard temporary tokens. It identifies the latest CSV not present in the
provided <upstream_commit>
, and tokens present that CSV that are not in the
newly added tokens are discarded. This helps keep temporary tokens (e.g from
debug logs) out of the database.
JSON support#
While pw_tokenizer doesn’t specify a JSON database format, a token database can be created from a JSON formatted array of strings. This is useful for side-band token database generation for strings that are not embedded as parsable tokens in compiled binaries. See Create a database for instructions on generating a token database from a JSON file.
Managing token databases#
Token databases are managed with the database.py
script. This script can be
used to extract tokens from compilation artifacts and manage database files.
Invoke database.py
with -h
for full usage information.
An example ELF file with tokenized logs is provided at
pw_tokenizer/py/example_binary_with_tokenized_strings.elf
. You can use that
file to experiment with the database.py
commands.
Create a database#
The create
command makes a new token database from ELF files (.elf, .o, .so,
etc.), archives (.a), existing token databases (CSV or binary), or a JSON file
containing an array of strings.
$ ./database.py create --database DATABASE_NAME ELF_OR_DATABASE_FILE...
Two database output formats are supported: CSV and binary. Provide
--type binary
to create
to generate a binary database instead of the
default CSV. CSV databases are great for checking into a source control or for
human review. Binary databases are more compact and simpler to parse. The C++
detokenizer library only supports binary databases currently.
Update a database#
As new tokenized strings are added, update the database with the add
command.
$ ./database.py add --database DATABASE_NAME ELF_OR_DATABASE_FILE...
This command adds new tokens from ELF files or other databases to the database. Adding tokens already present in the database updates the date removed, if any, to the latest.
A CSV token database can be checked into a source repository and updated as code
changes are made. The build system can invoke database.py
to update the
database after each build.
GN integration#
Token databases may be updated or created as part of a GN build. The
pw_tokenizer_database
template provided by
$dir_pw_tokenizer/database.gni
automatically updates an in-source tokenized
strings database or creates a new database with artifacts from one or more GN
targets or other database files.
To create a new database, set the create
variable to the desired database
type ("csv"
or "binary"
). The database will be created in the output
directory. To update an existing database, provide the path to the database with
the database
variable.
import("//build_overrides/pigweed.gni")
import("$dir_pw_tokenizer/database.gni")
pw_tokenizer_database("my_database") {
database = "database_in_the_source_tree.csv"
targets = [ "//firmware/image:foo(//targets/my_board:some_toolchain)" ]
input_databases = [ "other_database.csv" ]
}
Instead of specifying GN targets, paths or globs to output files may be provided
with the paths
option.
pw_tokenizer_database("my_database") {
database = "database_in_the_source_tree.csv"
deps = [ ":apps" ]
optional_paths = [ "$root_build_dir/**/*.elf" ]
}
Note
The paths
and optional_targets
arguments do not add anything to
deps
, so there is no guarantee that the referenced artifacts will exist
when the database is updated. Provide targets
or deps
or build other
GN targets first if this is a concern.
CMake integration#
Token databases may be updated or created as part of a CMake build. The
pw_tokenizer_database
template provided by
$dir_pw_tokenizer/database.cmake
automatically updates an in-source tokenized
strings database or creates a new database with artifacts from a CMake target.
To create a new database, set the CREATE
variable to the desired database
type ("csv"
or "binary"
). The database will be created in the output
directory.
include("$dir_pw_tokenizer/database.cmake")
pw_tokenizer_database("my_database") {
CREATE binary
TARGET my_target.ext
DEPS ${deps_list}
}
To update an existing database, provide the path to the database with
the database
variable.
pw_tokenizer_database("my_database") {
DATABASE database_in_the_source_tree.csv
TARGET my_target.ext
DEPS ${deps_list}
}
Token collisions#
Tokens are calculated with a hash function. It is possible for different strings to hash to the same token. When this happens, multiple strings will have the same token in the database, and it may not be possible to unambiguously decode a token.
The detokenization tools attempt to resolve collisions automatically. Collisions are resolved based on two things:
whether the tokenized data matches the strings arguments’ (if any), and
if / when the string was marked as having been removed from the database.
Resolving collisions#
Collisions may occur occasionally. Run the command
python -m pw_tokenizer.database report <database>
to see information about a
token database, including any collisions.
If there are collisions, take the following steps to resolve them.
Change one of the colliding strings slightly to give it a new token.
In C (not C++), artificial collisions may occur if strings longer than
PW_TOKENIZER_CFG_C_HASH_LENGTH
are hashed. If this is happening, consider settingPW_TOKENIZER_CFG_C_HASH_LENGTH
to a larger value. Seepw_tokenizer/public/pw_tokenizer/config.h
.Run the
mark_removed
command with the latest version of the build artifacts to mark missing strings as removed. This deprioritizes them in collision resolution.$ python -m pw_tokenizer.database mark_removed --database <database> <ELF files>
The
purge
command may be used to delete these tokens from the database.
Probability of collisions#
Hashes of any size have a collision risk. The probability of one at least one collision occurring for a given number of strings is unintuitively high (this is known as the birthday problem). If fewer than 32 bits are used for tokens, the probability of collisions increases substantially.
This table shows the approximate number of strings that can be hashed to have a 1% or 50% probability of at least one collision (assuming a uniform, random hash).
Token bits |
Collision probability by string count |
|
---|---|---|
50% |
1% |
|
32 |
77000 |
9300 |
31 |
54000 |
6600 |
24 |
4800 |
580 |
16 |
300 |
36 |
8 |
19 |
3 |
Keep this table in mind when masking tokens (see Reduce token size with masking). 16 bits might be acceptable when tokenizing a small set of strings, such as module names, but won’t be suitable for large sets of strings, like log messages.