On-disk format and design#
The pw_kvs
module stores data as a log of entries written sequentially into
flash sectors. This append-only design enables safe recovery from unexpected
power loss. The system is designed primarily for
NOR flash,
which is common in embedded systems. It does not include features required for
NAND flash, such as bad block management, as a dedicated flash translation
layer (FTL) typically handles these.
High-level structure: a log of entries in sectors#
pw_kvs
divides flash memory into sectors. The KVS operates as a
log-structured file system, meaning pw_kvs
always appends data to the log;
it never modifies data in place.
pw_kvs
writes entries one after another into the current active sector.
When updating a key, pw_kvs
writes a new entry containing the new value to
the end of the log. pw_kvs
considers the previous entry for that key
“stale.” pw_kvs
handles deletes similarly, by writing a special
“tombstone” entry.
This append-only approach suits NOR flash memory, which allows writing in small increments but requires erasing in larger blocks. Garbage collection is triggered during maintenance operations to reclaim space from stale entries.
Low-level structure: the entry format#
pw_kvs
stores each piece of data in an “entry” packet. This packet
contains metadata alongside the key and value. The structure is compact and
efficient for embedded systems.
Field |
Size |
Description |
---|---|---|
Magic |
32-bit |
Identifier that marks the beginning of a valid entry and indicates the data format version. |
Checksum |
Variable |
Checksum (e.g. CRC16) of the entry’s contents, which verifies data
integrity. The |
Alignment |
8-bit |
An 8-bit field (alignment_units) used to calculate the entry’s
alignment. The alignment in bytes is |
Key Length |
8-bit |
Length of the key string in bytes. |
Value Size |
16-bit |
Size of the value in bytes. A special value of |
Transaction ID |
32-bit |
Monotonically increasing number that orders the entries. |
Key |
Variable |
Key string, with a variable length up to 255 bytes. It is not null-terminated. |
Value |
Variable |
Data associated with the key. |
Padding |
Variable |
Zero-bytes added to the end of the entry so that its total size is a multiple of the entry’s calculated alignment. |
Design principle: no global metadata#
Many storage systems maintain a central block of metadata (e.g., allocation tables, journals) that contains information about the overall state of the system. While this can offer performance benefits, it also introduces a single point of failure.
pw_kvs
avoids this by deriving its state entirely from the individual
key-value entries stored in flash. This design has several advantages:
Robustness: No single block of data can be corrupted and render the entire KVS unusable. Also, no single block requires repeated modification, which would lead to flash degradation.
Simplicity: This simplifies KVS logic, since no complicated invariants exist between the entries and a global metadata block.
With this approach, pw_kvs
initializes by scanning the entire flash
partition and reading all valid entries. The tradeoff for avoiding global
metadata is that the KVS can’t easily support multi-key transactions, where
pw_kvs
modifies values for multiple keys, and records either all updates
(in a successful write) or none (in a data loss event). This tradeoff is
sufficient for common embedded use cases.
Note
Depending on the size of the flash partition and the number of entries, KVS initialization can take a significant amount of time. Regularly running maintenance operations (GC) will help keep the KVS responsive.
Invariants#
The KVS maintains the following invariants.
One free sector for garbage collection: The KVS requires at least one free (erased) sector at all times to guarantee that garbage collection can always proceed.
Entries do not cross sector boundaries: An entry must be fully contained within a single sector. This is a direct consequence of the physical constraints of flash hardware, which can only be erased in fixed-size blocks (sectors). If an entry spanned two sectors, erasing one would corrupt the entry.
Self-contained entries: Every entry is a complete, self-describing record. It contains all the information needed to interpret its own data, including its size (via alignment), key length, and value size.
Redundant copies in different sectors: The KVS maintains a configured number of copies (replicas) for each logical key-value entry. Each of these entries is bit-for-bit identical, including the key, value, and transaction ID. As a placement heuristic,
pw_kvs
stores each copy in a different flash sector to protect against sector-level corruption. If a corruption event or flash failure leads to the loss of one or more copies, the KVS detects this deficit and restores the configured level of redundancy.Highest transaction ID is newest: For a given key, the entry with the highest transaction ID is the most recent one.
Implementation details#
Alignment, padding, and forward compatibility#
The Alignment
and Padding
fields work together to ensure that
pw_kvs
stores each entry correctly on flash memory, which often has
specific alignment requirements for writes.
pw_kvs
stores the required alignment within each entry’s header. This
8-bit value isn’t the alignment itself, but an alignment_units
value used
to calculate byte alignment with the formula (alignment_units + 1) * 16
.
Advantages of this approach:
Flexibility: It supports alignments from 16 bytes (for
alignment_units = 0
) to 4096 bytes (foralignment_units = 255
), which is sufficient for most flash hardware.Space efficiency: Storing a single 8-bit integer is more compact than storing a 16-bit or 32-bit alignment value.
Forward compatibility: Because each entry is self-describing, a future firmware update can change the alignment for new entries. The new firmware can still correctly calculate the size of older entries. When moving an old entry during garbage collection,
pw_kvs
rewrites it with the new, larger alignment, upgrading the data in place over time.
Transaction ID rollover#
To identify the most recent version of a key, the KVS uses a 32-bit transaction ID incremented on every write.
By design, the KVS does not handle the rollover of this transaction ID. This is a practical trade-off based on the lifecycle of typical flash hardware. A 32-bit transaction ID provides approximately 4.3 billion unique IDs. In contrast, a standard NOR flash sector has a write/erase endurance of around 100,000 cycles per sector.
Because pw_kvs
distributes writes across all available sectors for
wear-leveling, physical flash memory will likely wear out long before the
transaction ID space is exhausted. For most embedded applications, this makes
transaction ID rollover a theoretical concern rather than a practical one.