Oveview

Basic helpers for reading and writing UTF-8-encoded strings.

Classes
class	pw::utf::CodePointAndSize

class	pw::utf8::EncodedCodePoint
	Encapsulates the result of encoding a single code point as UTF-8. More...

Functions
constexpr bool	pw::utf::IsValidCodepoint (uint32_t code_point)

constexpr bool	pw::utf::IsValidCharacter (uint32_t code_point)

constexpr pw::Result< utf::CodePointAndSize >	pw::utf8::ReadCodePoint (std::string_view str)
	Reads the first code point from a UTF-8 encoded `str`.

constexpr bool	pw::utf8::IsStringValid (std::string_view str)
	Determines if `str` is a valid UTF-8 string.

constexpr Result< EncodedCodePoint >	pw::utf8::EncodeCodePoint (uint32_t code_point)
	Encodes a single code point as UTF-8.

Status	pw::utf8::WriteCodePoint (uint32_t code_point, pw::StringBuilder &output)
	Helper that writes a code point to the provided `pw::StringBuilder`.

Function Documentation

◆ EncodeCodePoint()

constexpr Result< EncodedCodePoint > pw::utf8::EncodeCodePoint ( uint32_t code_point )

constexpr

Encodes a single code point as UTF-8.

UTF-8 encodes as 1-4 bytes from a range of [0, 0x10FFFF].

1-byte encoding has a top bit of zero:

[0, 0x7F] 1-bytes: b0xxx xxxx

N-bytes sequences are denoted by annotating the top N+1 bits of the leading byte and then using a 2-bit continuation marker on the following bytes.

[0x00080, 0x0007FF] 2-bytes: b110x xxxx 10xx xxxx
[0x00800, 0x00FFFF] 3-bytes: b1110 xxxx 10xx xxxx 10xx xxxx
[0x10000, 0x10FFFF] 4-bytes: b1111 0xxx 10xx xxxx 10xx xxxx 10xx xxxx

Returns

embed:rst:leading-asterisk
 
* 
*  .. pw-status-codes::
* 
*     OK: The codepoint encoded as UTF-8.
* 
*     OUT_OF_RANGE: The code point was not in the valid range for UTF-8
*     encoding.
* 
*

◆ IsValidCharacter()

constexpr bool pw::utf::IsValidCharacter ( uint32_t code_point )

inlineconstexpr

Checks if the code point is a valid character.

Excludes non-characters (U+FDD0..U+FDEF, and all codepoints ending in 0xFFFE or 0xFFFF) from the set of valid code points.

◆ IsValidCodepoint()

constexpr bool pw::utf::IsValidCodepoint ( uint32_t code_point )

inlineconstexpr

Checks if the code point is in a valid range.

Excludes the surrogate code points ([0xD800, 0xDFFF]) and codepoints larger than 0x10FFFF (the highest codepoint allowed). Non-characters and unassigned codepoints are allowed.

◆ ReadCodePoint()

constexpr pw::Result< utf::CodePointAndSize > pw::utf8::ReadCodePoint ( std::string_view str )

constexpr

Reads the first code point from a UTF-8 encoded str.

This is a very basic decoder without much thought for performance and very basic validation that the correct number of bytes are available and that each byte of a multibyte sequence has a continuation character. See pw::utf8::EncodeCharacter() for encoding details.