Pigweed
 
Loading...
Searching...
No Matches
pw::tokenizer::TokenDatabase Class Reference

#include <token_database.h>

Classes

class  Entries
 
struct  Entry
 An entry in the token database. More...
 
class  iterator
 Iterator for TokenDatabase values. More...
 

Public Types

using value_type = Entry
 
using size_type = std::size_t
 
using difference_type = std::ptrdiff_t
 
using reference = value_type &
 
using const_reference = const value_type &
 
using pointer = const value_type *
 
using const_pointer = const value_type *
 
using const_iterator = iterator
 
using reverse_iterator = std::reverse_iterator< iterator >
 
using const_reverse_iterator = std::reverse_iterator< const_iterator >
 

Public Member Functions

constexpr TokenDatabase ()
 Creates a database with no data. ok() returns false.
 
Entries Find (uint32_t token) const
 Returns all entries associated with this token. This is O(n).
 
constexpr size_type size () const
 Returns the total number of entries (unique token-string pairs).
 
constexpr bool ok () const
 
constexpr iterator begin () const
 Returns an iterator for the first token entry.
 
constexpr iterator end () const
 Returns an iterator for one past the last token entry.
 

Static Public Member Functions

template<typename ByteArray >
static constexpr bool IsValid (const ByteArray &bytes)
 
template<const auto & kDatabaseBytes>
static constexpr TokenDatabase Create ()
 
template<typename ByteArray >
static constexpr TokenDatabase Create (const ByteArray &database_bytes)
 

Static Public Attributes

static constexpr uint32_t kDateRemovedNever = 0xFFFFFFFF
 

Detailed Description

Reads entries from a v0 binary token string database. This class does not copy or modify the contents of the database.

The v0 token database has two significant shortcomings:

  • Strings cannot contain null terminators (\0). If a string contains a \0, the database will not work correctly.
  • The domain is not included in entries. All tokens belong to a single domain, which must be known independently.

A v0 binary token database is comprised of a 16-byte header followed by an array of 8-byte entries and a table of null-terminated strings. The header specifies the number of entries. Each entry contains information about a tokenized string: the token and removal date, if any. All fields are little- endian.

The token removal date is stored within an unsigned 32-bit integer. It is stored as <day> <month> <year>, where <day> and <month> are 1 byte each and <year> is two bytes. The fields are set to their maximum value (0xFF or 0xFFFF) if they are unset. With this format, dates may be compared naturally as unsigned integers.

embed:rst:leading-asterisk
 
*    ======  ====  =========================
*    Header (16 bytes)
*    ---------------------------------------
*    Offset  Size  Field
*    ======  ====  =========================
*         0     6  Magic number (``TOKENS``)
*         6     2  Version (``00 00``)
*         8     4  Entry count
*        12     4  Reserved
*    ======  ====  =========================
* 
*    ======  ====  ==================================
*    Entry (8 bytes)
*    ------------------------------------------------
*    Offset  Size  Field
*    ======  ====  ==================================
*         0     4  Token
*         4     1  Removal day (1-31, 255 if unset)
*         5     1  Removal month (1-12, 255 if unset)
*         6     2  Removal year (65535 if unset)
*    ======  ====  ==================================
*  

Entries are sorted by token. A string table with a null-terminated string for each entry in order follows the entries.

Entries are accessed by iterating over the database. A O(n) Find function is also provided. In typical use, a TokenDatabase is preprocessed by a pw::tokenizer::Detokenizer into a std::unordered_map.

Member Function Documentation

◆ Create() [1/2]

template<const auto & kDatabaseBytes>
static constexpr TokenDatabase pw::tokenizer::TokenDatabase::Create ( )
inlinestaticconstexpr

Creates a TokenDatabase and checks if the provided data is valid at compile time. Accepts references to constexpr containers (array, span, string_view, etc.) with static storage duration. For example:

constexpr char kMyData[] = ...;
constexpr TokenDatabase db = TokenDatabase::Create<kMyData>();
Definition: token_database.h:75

◆ Create() [2/2]

template<typename ByteArray >
static constexpr TokenDatabase pw::tokenizer::TokenDatabase::Create ( const ByteArray &  database_bytes)
inlinestaticconstexpr

Creates a TokenDatabase from the provided byte array. The array may be a span, array, or other container type. If the data is not valid, returns a default-constructed database for which ok() is false.

Prefer the Create overload that takes the data as a template parameter when possible, since that overload verifies data integrity at compile time.

◆ IsValid()

template<typename ByteArray >
static constexpr bool pw::tokenizer::TokenDatabase::IsValid ( const ByteArray &  bytes)
inlinestaticconstexpr

Returns true if the provided data is a valid token database. This checks the magic number (TOKENS), version (which must be 0), and that there is is one string for each entry in the database. A database with extra strings or other trailing data is considered valid.

◆ ok()

constexpr bool pw::tokenizer::TokenDatabase::ok ( ) const
inlineconstexpr

True if this database was constructed with valid data. The database might be empty, but it has an intact header and a string for each entry.

Member Data Documentation

◆ kDateRemovedNever

constexpr uint32_t pw::tokenizer::TokenDatabase::kDateRemovedNever = 0xFFFFFFFF
staticconstexpr

Default date_removed for an entry in the token datase if it was never removed.


The documentation for this class was generated from the following file: