Pigweed Blog #4: Fixed-point arithmetic as a replacement for soft floats#

Published on 2024-09-03 by Leonard Chan

Fixed-point arithmetic is an alternative to floating-point arithmetic for representing fractional data values (0.5, -1.125, 10.75, etc.). Fixed-point types can be represented as scaled integers. The advantage here is many arithmetic operations (+, -, *, /, etc.) can be implemented as normal integral instructions. This can be useful for embedded systems or other applications where floating-point operations can be too expensive or not available. Clang has support for fixed-point types according to ISO TR 18037. I work on a Google project that uses Pigweed. Recently, we replaced usage of floats in an embedded project running on Cortex M0+. This MCU doesn’t provide any hardware support for floating-point operations, so floating-point operations are implemented in software. We’ve observed a ~2x speed improvement in classification algorithms and a small decrease in binary size using fixed-point without sacrificing correctness.

All Clang users can benefit from this by adding -ffixed-point as a compilation flag. All of the types and semantics described in ISO 18037 are supported.

Problems with (soft) floats#

One of the ways to represent a decimal number is to use a significand, base, and exponent. This is how floating-point numbers according to IEEE 754 are represented. The base is two, but the significand and exponent are adjustable.

When doing arithmetic with floats, many operations are needed to align the exponents of the operands and normalize the result. Most modern computers have Floating-Point Units (FPUs) that do this arithmetic very quickly, but not all platforms (especially embedded ones) have this luxury. The alternative is to emulate floating-point arithmetic in software which can be costly.

Intro to fixed-point types and scaled integers#

ISO TR 18037 describes 12 fixed-point types with varying scales and ranges:

unsigned short _Fract
unsigned _Fract
unsigned long _Fract
signed short _Fract
signed _Fract
signed long _Fract
unsigned short _Accum
unsigned _Accum
unsigned long _Accum
signed short _Accum
signed _Accum
signed long _Accum

The scale represents the number of fractional bits in the type. Under the hood, Clang implements each of these types as integers. To get the decimal value of the type, we just divide the integer by the scale. For example, on x86_64 Linux, an unsigned _Accum type is represented as an unsigned 32-bit integer with a scale of 16, so 16 bits are used to represent the fractional part and the remaining 16 are used to represent the integral part. An unsigned _Accum type with a value of 16.625 would be represented as 1089536 because 1089536 / 2**16 = 16.625. Additionally, an unsigned _Accum type has a range of [0, 65535.99998474121], where the max value is represented as 2^32-1. The smallest possible value we can represent with a fixed-point type is 1/2^scale.

The width and scale of each type are platform-dependent. In Clang, different targets are free to provide whatever widths and scales are best-suited to their needs and capabilities. Likewise, LLVM backends are free to lower the different fixed-point intrinsics to native fixed-point operations. The default configuration for all targets is the “typical desktop processor” implementation described in A.3 of ISO TR 18037.

Each of these types has a saturating equivalent also (denoted by the _Sat keyword). This means operations on a saturating type that would normally cause overflow instead clamps to the maximal or minimal value for that type.

Fixed-point to the rescue#

One of the main advantages of fixed-point types is their operations can be done with normal integer operations. Addition and subtraction between the same fixed-point types normally require just a regular integral add or sub instruction. Multiplication and division can normally be done with a regular mul and div instruction (plus an extra shift to account for the scale). Platforms that don’t support hard floats normally need to make a call to a library function that implements the floating-point equivalents of these which can be large or CPU-intensive.

Fixed-point types however are limited in their range and precision which are dependent on the number of integral and fractional bits. A signed _Accum (which uses 15 fractional bits, 16 integral bits, and 1 sign bit on x86_64 Linux) will have a range of [-65536, 65535.99996948242] and a precision of 3.0517578125e-05. Floats can effectively range from (-∞, +∞) and have precision as small as \(10^{-38}\). Maintaining correctness for your program largely depends on how much range and precision you intend your fractional numbers to have. For addressing ranges, there are sometimes a couple of clever math tricks you can do to get large values to fit within these ranges. Some of these methods will be discussed later.

Fixed-Point vs Floating-Point#
	Fixed-Point	Floating-Point
Range	Limited by integral and fractional bits	+/-∞
Precision	Limited by the fractional bits (\(1/2^{scale}\))	\(10^{-38}\)
Binary Operations	Typically fast Usually just involves normal integral binary ops with some shifts	Hard floats - Typically fast Soft floats - Can be very slow/complex; depends on the implementation
Correctness	Always within 1 ULP of the true result Will always produce the same result for a given op	Always within 0.5 ULP of true result May not always produce the same result for a given op due to rounding errors

Running on Cortex-M0+#

The Arm Cortex-M0+ processor is a common choice for constrained embedded applications because it is very energy-efficient. Because it doesn’t have an FPU, many floating-point operations targeting this CPU depend on a helper library to define these as runtime functions. For our application, the code running on this processor runs different classification algorithms. One such classifier utilizes softmax regression. The algorithm is roughly:

// This is roughly a trimmed down version of the softmax regression
// classifier.
size_t Classify(std::span<const float> sample) const {
    // 1. Compute activations for each category.
    //    All activations are initially zero.
    std::array<float, NumCategories> activations{};
    for (size_t i = 0; i < NumCategories; i++) {
        for (size_t j = 0; j < sample.size(); j++) {
            activations[i] += coefficients_[i][j] * sample[j];
        }
        activations[i] += biases_[i];
    }
    float max_activation = *std::max_element(activations.begin(),
                                             activations.end());

    // 2. Get e raised to each of these activations.
    std::array<T, NumCategories> exp_activations;
    for (size_t i = 0; i < NumCategories; i++) {
        exp_activations[i] = expf(activations[i]);
    }
    float sum_exp_activations = std::accumulate(exp_activations.begin(),
                                                exp_activations.end(), 0);

    // 3. Compute each likelihood which us exp(x[i]) / sum(x).
    float max_likelihood = std::numeric_limits<float>::min();
    size_t prediction;
    for (size_t i = 0; i < NumCategories; i++) {
        // 0 <= result.likelihoods[i] < 1
        result.likelihoods[i] = exp_activations[i] / sum_exp_activations;
        if (result.likelihoods[i] > max_likelihood) {
            max_likelihood = result.likelihoods[i];
            prediction = i;  // The highest likelihood is the prediction.
        }
    }

    return prediction;  // Return the index of the highest likelihood.
}

Many of these operations involve normal floating-point comparison, addition, multiplication, and division. Each of these expands to a call to some __aeabi_* function. This particular function is on a hot path that activates, meaning soft float operations take up much computation and power. For the normal binary operations, fixed-point types might be a good fit because it’s very likely each of the binary operations will result in a value that can fit into one of the fixed-point types. (Each of the sample values is in the range [-256, +256] and there are at most 10 categories. Likewise, each of the coefficients_ and biases_ are small values ranging from roughly (-3, +3).)

If we wanted to completely replace floats for this function with fixed-point types, we’d run into two issues:

There doesn’t exist an expf function that operates on fixed-point types.
expf(x) can grow very large for even small for small values of x (expf(23) would exceed the largest value possible for an unsigned long _Accum).

Fortunately, we can refactor the code as needed and we can make changes to llvm-libc, the libc implementation this device uses.

llvm-libc and some math#

The embedded device’s codebase is dependent on a handful of functions that take floats: expf, sqrtf, and roundf. printf is also used for printing floats, but support for the fixed-point format specifiers is needed. The project already uses llvm-libc, so we can implement versions of these functions that operate on fixed-point types. The llvm-libc team at Google has been very responsive and eager to implement these functions for us! Michael Jones (who implemented much of printf in llvm-libc) provided printf support for each of the fixed point types. Tue Ly provided implementations for abs, roundf, sqrtf, integral sqrt with a fixed-point output, and expf for different fixed-point types. Now we have the necessary math functions which accept fixed-point types.

With implementations provided, the next big step is avoiding overflow and getting results to fit within the new types. Let’s look at one case involving expf and one involving sqrtf. (Tue Ly also provided these solutions.)

The softmax algorithm above easily causes overflow with:

exp_activations[i] = expf(activations[i]);

But the larger goal is to get the likelihood that a sample matches each category. This is a value from [0, 1].

\[likelihood(i) = \frac{e^{activations[i]}}{\sum_{k=0}^{NumCategories}{e^{activation[k]}}} = \frac{e^{activation[i]]}}{sum(e^{activation[...]}))}\]

This can however be normalized by the max activation:

\[likelihood(i) = \frac{e^{activation[i]]}}{sum(e^{activation[...]}))} = \frac{e^{activation[i]]}}{sum(e^{activation[...]}))} * \frac{max(e^{activation[...]})}{max(e^{activation[...]})} = \frac{e^{activation[i]-max(activation[...])]}}{sum(e^{activation[...]-max(activation[...])}))}\]

This makes the exponent for each component at most zero and the result of exp(x) at most 1 which can definitely fit in the fixed-point types. Likewise, the sum of each of these is at most NumCategories (which happens to be 10 for us). For the above code, the only necessary change required is:

exp_activations[i] = expf(activations[i] - max_activation);

And lucky for us, the precision for a signed _Accum type is enough to get us the same results. One interesting thing is this change alone also improved the performance when using floats by 11%. The theory is the expf implementation has a quick switch to see if the exponents need to be adjusted for floats, and skips that scaling part when unnecessary.

The code involving sqrtf can be adjusted similarly:

void TouchProcessor::CharacteriseRadialDeviation(Touch& touch) {
    // Compute center of touch.
    int32_t sum_x_w = 0, sum_y_w = 0, sum_w = 0;
    // touch.num_position_estimates is at most 100
    for (size_t i = 0; i < touch.num_position_estimates; i++) {
        sum_x_w += touch.position_estimates[i].position.x * 255;
        sum_y_w += touch.position_estimates[i].position.y * 255;
        sum_w += 255;
    }

    // Cast is safe, since average can't exceed any of the individual values.
    touch.center = Point{
        .x = static_cast<int16_t>(sum_x_w / sum_w),
        .y = static_cast<int16_t>(sum_y_w / sum_w),
    };

    // Compute radial deviation from center.
    float sum_d_squared_w = 0.0f;
    for (size_t i = 0; i < touch.num_position_estimates; i++) {
        int32_t dx = touch.position_estimates[i].position.x - touch.center.x;
        int32_t dy = touch.position_estimates[i].position.y - touch.center.y;
        sum_d_squared_w += static_cast<float>(dx * dx + dy * dy) * 255;
    }

    // deviation = sqrt(sum((dx^2 + dy^2) * w) / sum(w))
    touch.features[static_cast<size_t>(Touch::Feature::kRadialDeviation)]
        = sqrtf(sum_d_squared_w / sum_w);
}

We know beforehand the maximal values of dx and dy are +/-750, so sum_d_squared_w will require as much as 35 integral bits to represent the final sum. sum_w also needs 11 bits, so sqrtf would need to accept a value that can be held in ~24 bits. This can definitely fit into a signed long _Accum which has 32 integral bits, but ideally we’d like to use a _Sat signed _Accum for consistency. Similar to before, we can normalize the value by 255. That is, we can avoid multiplying by 255 when calculating sum_x_w, sum_y_w, and sum_w since this will result in the exact same touch.center results. Likewise, if we avoid the * 255 when calculating sum_d_squared_w, the final sqrtf result will be unchanged. This makes the code:

void TouchProcessor::CharacteriseRadialDeviation(Touch& touch) {
  // Compute center of touch.
  int32_t sum_x_w = 0, sum_y_w = 0, sum_w = touch.num_position_estimates;
  // touch.num_position_estimates is at most 100
  for (size_t i = 0; i < touch.num_position_estimates; i++) {
      sum_x_w += touch.position_estimates[i].position.x;
      sum_y_w += touch.position_estimates[i].position.y;
  }

  // Cast is safe, since average can't exceed any of the individual values.
  touch.center = Point{
      .x = static_cast<int16_t>(sum_x_w / sum_w),
      .y = static_cast<int16_t>(sum_y_w / sum_w),
  };

  // Compute radial deviation from center.
  float sum_d_squared_w = 0.0f;
  for (size_t i = 0; i < touch.num_position_estimates; i++) {
      int32_t dx = touch.position_estimates[i].position.x - touch.center.x;
      int32_t dy = touch.position_estimates[i].position.y - touch.center.y;
      sum_d_squared_w += static_cast<float>(dx * dx + dy * dy);
  }

  // deviation = sqrt(sum((dx^2 + dy^2) * w) / sum(w))
  touch.features[static_cast<size_t>(Touch::Feature::kRadialDeviation)]
      = sqrtf(sum_d_squared_w / sum_w);
}

This allows sum_d_squared_w to fit within 31 integral bits. We can go further and realize sum_d_squared_w will always have an integral value and replace its type with a uint32_t. This will mean any fractional part from sum_d_squared_w / sum_w will be discarded, but for our use case, this is acceptable since the integral part of sum_d_squared_w / sum_w is so large that the fractional part is negligible. touch.features[i] also isn’t propagated significantly so we don’t need to worry about propagation of error. With this, we can replace the sqrtf with one that accepts an integral type and returns a fixed-point type:

touch.features[static_cast<size_t>(Touch::Feature::kRadialDeviation)]
    = isqrt(sum_d_squared_w / sum_w);

Results#

Classification runtime performance#

Before any of this effort, a single run of classifying sensor input took on average 856.8 us.

Currently, with all floats substituted with fixed-point types, a single run is 412.845673 us which is a ~2.1x speed improvement.

It’s worth noting that the optimizations we described above also improved performance when using floats, making a single run using floats 771.475464 us which translates to ~1.9x speed improvement.

Size#

We’ve observed ~1.25 KiB saved (out of a ~177 KiB binary) when switching to fixed-point.

$ bloaty app.elf.fixed -- app.elf.float
    FILE SIZE        VM SIZE
 --------------  --------------
  +0.5% +3.06Ki  [ = ]       0    .debug_line
  +0.1% +1.89Ki  [ = ]       0    .debug_str
  +0.2%    +496  [ = ]       0    .debug_abbrev
  +0.6%    +272  +0.6%    +272    .bss
  -0.3%     -16  -0.3%     -16    .data
  -0.3%    -232  [ = ]       0    .debug_frame
  -1.3%    -712  [ = ]       0    .debug_ranges
  -0.3% -1.11Ki  [ = ]       0    .debug_loc
  -1.2% -1.50Ki  -1.2% -1.50Ki    .text
  -1.6% -1.61Ki  [ = ]       0    .symtab
  -3.2% -3.06Ki  [ = ]       0    .pw_tokenizer.entries
  -1.2% -4.00Ki  [ = ]       0    .strtab
  -0.4% -19.6Ki  [ = ]       0    .debug_info
  -0.3% -26.1Ki  -0.7% -1.25Ki    TOTAL

A large amount of this can be attributed to not needing many of the __aeabi_* float functions provided by libc. All instances of floats were removed so they don’t get linked into the final binary. Instead, we use the fixed-point equivalents provided by llvm-libc. Much of these fixed-point functions are also smaller and make fewer calls to other functions. For example, the sqrtf we used makes calls to __ieee754_sqrtf, __aeabi_fcmpun, __aeabi_fcmplt, __errno, and __aeabi_fdiv whereas isqrtf_fast only makes calls to __clzsi2, __aeabi_lmul, and __aeabi_uidiv. It’s worth noting that individual fixed-point binary operations are slightly larger since they involve a bunch of shifts or saturation checks. Floats tend to make a call to the same library function, but fixed-point types emit instructions at every callsite.

float_add:
  push    {r7, lr}
  add     r7, sp, #0
  bl      __aeabi_fadd
  pop     {r7, pc}

fixed_add:
  adds    r0, r0, r1
  bvc     .LBB1_2
  asrs    r1, r0, #31
  movs    r0, #1
  lsls    r0, r0, #31
  eors    r0, r1
.LBB1_2:
  bx      lr

Although the callsite is shorter, it’s worth noting that the soft float functions can’t be inlined because they tend to be very large. The __aeabi_fadd in particular is 444 bytes large.

Correctness#

We observe the exact same results for our classification algorithms when using fixed-point types. We have over 15000 classification tests with none of them producing differing results. This effectively means we get all the mentioned benefits without any correctness cost.

Using fixed-point#

Fixed-point arithmetic can be used out-of-the-box with Clang by adding -ffixed-point on the command line. All of the types and semantics described in ISO 18037 are supported. llvm-libc is the only libc that supports fixed-point printing and math functions at the moment. Not all libc functions described in ISO 18037 are supported at the moment, but they will be added eventually! Pigweed users can use these fixed-point functions by using the //pw_libc:stdfix target.

Future work#

An option we could take to reduce size (at the cost of even more performance) is to potentially teach LLVM to outline fixed-point additions. This could result in something similar to the float libcalls.
See if we can instead use a regular signed _Accum in the codebase instead of a _Sat signed _Accum to save some instructions (or even see if we can use a smaller type).
Investigate performance compared to hard floats. If we’re able to afford using an unsaturated type, then many native floating-point ops could be replaced with normal integral ops.
Full libc support for the fixed-point C functions.