Lengthily: 03/17/14

This text is a continuation of the “Bubblin' up” post regarding floating point operations, and more specifically here I will explore how to properly sum such numbers. The problem presented is, that because of the specific representation of floating point numbers, every operation done on them is inherently imprecise. Additionally, even if it is assumed that the operations carry no error at all, there is a problem with storing the number. If the number can't be exactly represented in the specified floating point format (as discussed in the previous post), naturally a truncation error will be imposed. What does that mean? Firstly, do not trust floating point numbers. Yes, the statement is a bit exaggerated, but it is a good practice to always keep in mind that floating point numbers are just an approximation. And secondly, special care should be taken when working with them.

Truncation, as unseemly as it may appear, is inevitable because of the discreteness of computer memory. Nothing can be done about it, there is no computer with an infinite amount of memory. Here the machine epsilon arises as a natural measurement of the least significant number (in sense of mantissa bits) representable in a specific architecture. Rounding errors are a consequence of truncation and occur because no number can be represented in computers with an infinite amount of significant digits, meaning all floating point operations are approximate and need be rounded, hence the name. By design the machine epsilon is the upper bound on the relative error associated with rounding in floating point arithmetic.

All that said, summation is a classical problem illustrating the accumulation of error. Every addition will carry an error into the result, and even if the error is not large, the accumulation might be significant when adding many terms. Furthermore, because computers normalize the numbers (see wikipedia), if the values' exponents differ much, the smaller numbers' contributions might not be reflected in the sum at all! Although the problem is known, the solution is not as obvious or trivial as it might seem. So lets consider and compare few approaches (error estimations will not be derived, but only presented):

1. The naïve approach consists of just adding up the numbers in an accumulator. The accumulated error will grow linearly on the number of terms – O(n). A snippet in C++ illustrating the method is presented:

double doNaiveSum(double * numbers, int size)
{
    double sum = 0;
    for (int i = 0; i < size; i++)
        sum += numbers[i];
    return sum;
}

2. The naïve approach with presorting in which the numbers are sorted by absolute value before summing them. This case is very similar to the previous one, so no code snippet will be provided. The idea is to sum smaller numbers first so their contribution is not lost. The performance is only marginally better than the former case and the benefit is not guaranteed but depends heavily on the magnitude of the numbers, which might vary to a large degree in different applications.

3. Pairwise summation is a divide and conquer algorithm which consists in dividing each range by half, summing recursively each half and then adding the results. The method offers significant improvement over the naïve approach. The error grows logarithmically O(log(n)) for a minute increase in arithmetic operations. A minor inconvenience is that the natural implementation is recursive. Consider the following example code:

double doPairwiseSum(double * numbers, int size)
{
    if (size < 4)  {  // For 3 numbers or less just perform the regular summation
        double sum = 0;
        for (int i = 0; i < size; i++)
            sum += numbers[i];
        return sum;
    }

    // Calculate the left part size as half the original size, and the right part size as the reminder
    int leftSize = (size >> 1), rightSize = size - leftSize;

    // Perform a recursive sum on each part and return the result
    return doPairwiseSum(numbers, leftSize) + doPairwiseSum(numbers + leftSize, rightSize);
}

4. Kahan summation is a compensated summation technique which uses a running compensation to reduce the numerical error. The idea is to keep an additional variable, where the low-order bits, which would be otherwise lost, can be stored. Although Kahan's algorithm achieves O(1) error growth (a constant not depending on the number of additions), additional operations are required which might not always be appropriate. In such a case pairwise summations is a viable alternative with its logarithmic error growth.

double doCompensatedSum(double * numbers, int size)
{  
    double sum = 0, compensation = 0;
    for (int i = 0; i < size; i++)  {
        // Adjust the input value with the stored compensation value (the low-order bits from previous operations)
        double x = numbers[i] - compensation;
        // Calculate the new value for the sum
        double tempSum = sum + x;
        // Recover the low-order bits lost in the last addition
        compensation = (tempSum - sum) - x;
        // Update the sum variable
        sum = tempSum;
    }
    return sum;
}

When using this algorithm beware of overly aggressive compilers, which could in principle recognize that arithmetically the compensation should always be zero and perform optimizations effectively removing its calculation.

5. Arbitrary precision arithmetic is a brute-force solution where no special care is (implicitly) taken to limit the error growth. For most intents and purposes one of the above (or similar) algorithms would be more appropriate, since arbitrary precision floating points are usually emulated in software (at least to my knowledge) and are carrying a lot of computational and memory overhead.

Lengthily

Pages

Mar 17, 2014

Count your pluses