C/C++ API Reference
Loading...
Searching...
No Matches
UTF-8 helpers

Oveview

Basic helpers for reading and writing UTF-8-encoded strings.

Classes

class  pw::utf::CodePointAndSize
 
class  pw::utf8::EncodedCodePoint
 Encapsulates the result of encoding a single code point as UTF-8. More...
 

Functions

constexpr bool pw::utf::IsValidCodepoint (uint32_t code_point)
 
constexpr bool pw::utf::IsValidCharacter (uint32_t code_point)
 
constexpr pw::Result< utf::CodePointAndSizepw::utf8::ReadCodePoint (std::string_view str)
 Reads the first code point from a UTF-8 encoded str.
 
constexpr bool pw::utf8::IsStringValid (std::string_view str)
 Determines if str is a valid UTF-8 string.
 
constexpr Result< EncodedCodePointpw::utf8::EncodeCodePoint (uint32_t code_point)
 Encodes a single code point as UTF-8.
 
Status pw::utf8::WriteCodePoint (uint32_t code_point, pw::StringBuilder &output)
 Helper that writes a code point to the provided pw::StringBuilder.
 

Function Documentation

◆ EncodeCodePoint()

constexpr Result< EncodedCodePoint > pw::utf8::EncodeCodePoint ( uint32_t  code_point)
constexpr

Encodes a single code point as UTF-8.

UTF-8 encodes as 1-4 bytes from a range of [0, 0x10FFFF].

1-byte encoding has a top bit of zero:

[0, 0x7F] 1-bytes: b0xxx xxxx

N-bytes sequences are denoted by annotating the top N+1 bits of the leading byte and then using a 2-bit continuation marker on the following bytes.

[0x00080, 0x0007FF] 2-bytes: b110x xxxx 10xx xxxx
[0x00800, 0x00FFFF] 3-bytes: b1110 xxxx 10xx xxxx 10xx xxxx
[0x10000, 0x10FFFF] 4-bytes: b1111 0xxx 10xx xxxx 10xx xxxx 10xx xxxx
Returns
embed:rst:leading-asterisk
 
* 
*  .. pw-status-codes::
* 
*     OK: The codepoint encoded as UTF-8.
* 
*     OUT_OF_RANGE: The code point was not in the valid range for UTF-8
*     encoding.
* 
*  

◆ IsValidCharacter()

constexpr bool pw::utf::IsValidCharacter ( uint32_t  code_point)
inlineconstexpr

Checks if the code point is a valid character.

Excludes non-characters (U+FDD0..U+FDEF, and all codepoints ending in 0xFFFE or 0xFFFF) from the set of valid code points.

◆ IsValidCodepoint()

constexpr bool pw::utf::IsValidCodepoint ( uint32_t  code_point)
inlineconstexpr

Checks if the code point is in a valid range.

Excludes the surrogate code points ([0xD800, 0xDFFF]) and codepoints larger than 0x10FFFF (the highest codepoint allowed). Non-characters and unassigned codepoints are allowed.

◆ ReadCodePoint()

constexpr pw::Result< utf::CodePointAndSize > pw::utf8::ReadCodePoint ( std::string_view  str)
constexpr

Reads the first code point from a UTF-8 encoded str.

This is a very basic decoder without much thought for performance and very basic validation that the correct number of bytes are available and that each byte of a multibyte sequence has a continuation character. See pw::utf8::EncodeCharacter() for encoding details.

Returns
embed:rst:leading-asterisk
 
* 
*  .. pw-status-codes::
* 
*     OK: The decoded code point and the number of bytes read.
* 
*     INVALID_ARGUMENT: The string was empty or malformed.
* 
*     OUT_OF_RANGE: The decoded code point was not in the valid range.
* 
*