pw_tokenizer#

Compress strings to shrink logs by +75%

Stable C C++ Python Rust TypeScript Java Code Size Impact: 50% reduction in log size

Logging is critical, but developers are often forced to choose between additional logging or saving crucial flash space. The pw_tokenizer module enables extensive logging with substantially less memory usage by replacing printf-style strings with binary tokens during compilation. It is designed to integrate easily into existing logging systems.

Although the most common application of pw_tokenizer is binary logging, the tokenizer is general purpose and can be used to tokenize any strings, with or without printf-style arguments.

Why tokenize strings?

  • Dramatically reduce binary size by removing string literals from binaries.

  • Reduce I/O traffic, RAM, and flash usage by sending and storing compact tokens instead of strings. We’ve seen over 50% reduction in encoded log contents.

  • Reduce CPU usage by replacing snprintf calls with simple tokenization code.

  • Remove potentially sensitive log, assert, and other strings from binaries.

Get started

Integrate pw_tokenizer into your project.

Get started with pw_tokenizer
Tokenization

Convert strings and arguments to tokens.

Tokenization
Token databases

Store a mapping of tokens to the strings and arguments they represent.

Token databases
Detokenization

Expand tokens back to the strings and arguments they represent.

Detokenization
API reference

Detailed reference information about the pw_tokenizer API.

API reference

Tokenized logging in action#

Here’s an example of how pw_tokenizer enables you to store and send the same logging information using significantly less resources:

        flowchart TD

  subgraph after["After: Tokenized Logs (37 bytes saved!)"]
    after_log["LOG(#quot;Battery Voltage: %d mV#quot;, voltage)"] -- 4 bytes stored on-device as... -->
    after_encoding["d9 28 47 8e"] -- 6 bytes sent over the wire as... -->
    after_transmission["d9 28 47 8e aa 3e"] -- Displayed in logs as... -->
    after_display["#quot;Battery Voltage: 3989 mV#quot;"]
  end

  subgraph before["Before: No Tokenization"]
    before_log["LOG(#quot;Battery Voltage: %d mV#quot;, voltage)"] -- 41 bytes stored on-device as... -->
    before_encoding["#quot;Battery Voltage: %d mV#quot;"] -- 43 bytes sent over the wire as... -->
    before_transmission["#quot;Battery Voltage: 3989 mV#quot;"] -- Displayed in logs as... -->
    before_display["#quot;Battery Voltage: 3989 mV#quot;"]
  end

  style after stroke:#00c852,stroke-width:3px
  style before stroke:#ff5252,stroke-width:3px
    

A quick overview of how the tokenized version works:

  • You tokenize "Battery Voltage: %d mV" with a macro like PW_TOKENIZE_STRING. You can use pw_log_tokenized to handle the tokenization automatically.

  • After tokenization, "Battery Voltage: %d mV" becomes d9 28 47 8e.

  • The first 4 bytes sent over the wire is the tokenized version of "Battery Voltage: %d mV". The last 2 bytes are the value of voltage converted to a varint using pw_varint.

  • The logs are converted back to the original, human-readable message via the Detokenization API and a token database.