Reads entries from a v0 binary token string database. This class does not copy or modify the contents of the database.
The v0 token database has two significant shortcomings:
- Strings cannot contain null terminators (
\0
). If a string contains a \0
, the database will not work correctly.
- The domain is not included in entries. All tokens belong to a single domain, which must be known independently.
A v0 binary token database is comprised of a 16-byte header followed by an array of 8-byte entries and a table of null-terminated strings. The header specifies the number of entries. Each entry contains information about a tokenized string: the token and removal date, if any. All fields are little- endian.
The token removal date is stored within an unsigned 32-bit integer. It is stored as <day> <month> <year>
, where <day>
and <month>
are 1 byte each and <year>
is two bytes. The fields are set to their maximum value (0xFF
or 0xFFFF
) if they are unset. With this format, dates may be compared naturally as unsigned integers.
====== ==== =========================
Header (16 bytes)
---------------------------------------
Offset Size Field
====== ==== =========================
0 6 Magic number (``TOKENS``)
6 2 Version (``00 00``)
8 4 Entry count
12 4 Reserved
====== ==== =========================
====== ==== ==================================
Entry (8 bytes)
------------------------------------------------
Offset Size Field
====== ==== ==================================
0 4 Token
4 1 Removal day (1-31, 255 if unset)
5 1 Removal month (1-12, 255 if unset)
6 2 Removal year (65535 if unset)
====== ==== ==================================
Entries are sorted by token. A string table with a null-terminated string for each entry in order follows the entries.
Entries are accessed by iterating over the database. A O(n) Find
function is also provided. In typical use, a TokenDatabase
is preprocessed by a pw::tokenizer::Detokenizer
into a std::unordered_map
.