In C++, like any other programming language, computations are performed using the available set of operators, the intrinsic functions implemented by the compiler, and the set of core mathematical functions provided by the standard C and C++ libraries, or some alternative implementation.

The header, HCCMath.h provides alternatives to many of the core mathematical function that is specified in the C and C++ standards. All of the functions can be constexpr evaluated, and several offer runtime performance benefits as well.

The functions are implemented in the Harlinn::Common::Core::Math namespace.

Unit Tests

Extensive unit tests, available here, strives to demonstrate the accuracy of the computations.

Benchmarks

The performance of the functions is benchmarked using the Google benchmark library, and can be verified by building and executing BasicMathBenchmarks included in the Harlinn.Windows solution.

Benchmarks for a single inline function cannot be relied upon to accurately determine how well the function will perform perform in a real application. For release builds, the compiler and linker, employs global optimization strategies, attempting to optimize the operations across all the compilation units. The global optimization strategies will often find optimization opportunities that are very hard to detect and implement manually, and the only way to really determine if one set of functions performs better than another, is to try them out in a real, computationally intensive, application.

It is, however, unlikely that a set of functions that performs worse than another, in a reasonable set of benchmarks, can outperform the other in a real application.

PBRTO a micro optimized raytracing app

PBRTO is a micro optimized version of PBRT-v4, under development as an example of how the functionality in HCCMath.h, HCCSIMD.h and HCCVectorMath.h can be used to optimize the performance of real, computationally intensive, apps. It’s now about 35 % faster than the release build of the original PBRT. more…

Background

The functions was created to enable constexpr evaluation of mathematical expressions, since nothing improves runtime performance as much as making the compiler calculate the results at compile time.

Much of the code is based on version 0.8.5 of the OpenLibm mathematical C library used by the Julia programming language.

The library does not include the OpenLibm floating point environment, and relies on the floating point environment provided by the Visual C++ runtime.

Some functions, like Sin, can only perform constexpr evaluation for a subset of the possible arguments. Sin has no problems with constexpr evaluation for \(\pm 20000^\circ\), but fails to constexpr evaluate Sin(1.7976931348623158e+308).

Implementation quality

The quality of the implementation is, since it is based on OpenLibm, very high. OpenLibm does many things very well, but sometimes the Visual C++ runtime, an intrinsic function, or other alternative implemented by the library, performs better. When this is the case, the library selects the implementation with the best runtime performance.

Testing

Rather extensive changes to the OpenLibm code was required to enable constexpr evaluation, and there are currently 535 unit tests helping to ensure the quality of the mathematical parts of the library.

Several of the tests execute the function under test 20 000 times using random generated values, while others try every possible value over the range of values most often used with the function.

The functions in the Math namespace that use the constepr path implementation at runtime are thoroughly tested, and the tests tries to determine the maximum deviation between the standard function and the corresponding function in the Math namespace.

Deviation is calculated by the Deviation function below.

The value passed for the first argument is the expected result, usually calculated using the standard implementation, while the value calculated by the corresponding function in the Math namespace is passed as the second argument.

inline double Deviation( double first, double second )
{
    // If both is NaN, the results don't deviate
    if ( std::isnan( first ) )
    {
        if ( std::isnan( second ) )
        {
            return 0.0;
        }
        return std::numeric_limits<double>::infinity( );
    }
    else if ( std::isnan( second ) )
    {
        // The second value is NaN, but not the first
        return std::numeric_limits<double>::infinity( );
    }
    if ( std::isinf( first ) )
    {
        if ( std::isinf( second ) )
        {
            if ( first > 0. && second > 0. )
            {
                // Both values are +infinity
                return 0;
            }
            else if ( first < 0. && second < 0. )
            {
                // Both values are -infinity
                return 0;
            }
            // Opposite signs
            return std::numeric_limits<double>::infinity( );
        }
        // only the first value is infinite
        return std::numeric_limits<double>::infinity( );
    }
    else if ( std::isinf( second ) )
    {
        // only the second value is infinite
        return std::numeric_limits<double>::infinity( );
    }

    // Avoid division by zero
    if ( first != 0.0 )
    {
        using std::abs;
        if ( first <= second )
        {
            return abs( second - first ) / abs( first );
        }
        else
        {
            return abs( first - second ) / abs( first );
        }
    }
    else
    {
        // When second is very close to zero, the result is zero deviation
        constexpr double veryCloseToZero = 5e-323;
        auto absSecond = abs( second );
        if ( absSecond <= veryCloseToZero )
        {
            return 0.0;
        }
        // May still be very close to zero, but will cause the test to fail.
        return 1.0;
    }
}

Exceptional performance

A few functions outperforms the standard implementation spectacularly, like Exp which outperforms std::exp by 1200 %.

The two implementations returns the same result for 2261694913 out of 2288746510 cases, and when tested with double precision floating point argument values, uniformly distributed over the interval [-744.0, 710.0], the maximum deviation, 1.56426946755e-12 was obtained when passing -717.256469727 as the argument to the functions.

Using SIMD::Traits<T.N>

Some functions, like Hypot, use the SIMD::Traits<T.N> specializations to achieve excellent runtime performance.

template<typename T>
    requires IsFloatingPoint<T>
constexpr inline std::remove_cvref_t<T> Hypot( T x, T y, T z ) noexcept
{
    if ( std::is_constant_evaluated( ) )
    {
        return Math::Internal::OpenLibM::FastHypot( x, y );
    }
    else
    {
        using FloatT = std::remove_cvref_t<T>;
        using Traits = SIMD::Traits<FloatT, 3>;

        auto v = Traits::Set( z, y, x );
        v = Traits::Mul( v, v );
        v = Traits::HSum( v );
        v = Traits::Sqrt( v );
        return Traits::First( v );
    }
}

Using the standard, and the internal, implementation at runtime.

Math::Internal::TanImpl performs about 40 % worse than std::tan, for double precision floating point values, but Math::Internal::TanImpl beats std::tan, for single precision floating point values, by more than 60 %, and splitting the execution path between Math::Internal::TanImpl for single precision floating point values, and std::tan provides the best solution:

template<typename T>
    requires IsFloatingPoint<T>
constexpr inline std::remove_cvref_t<T> Tan( T x ) noexcept
{
    using FloatT = std::remove_cvref_t<T>;
    if ( std::is_constant_evaluated( ) )
    {
        if constexpr ( std::is_same_v<FloatT, float> )
        {
            return Math::Internal::OpenLibM::tanf( x );
        }
        else
        {
            return Math::Internal::OpenLibM::tan( x );
        }
    }
    else
    {
        if constexpr ( std::is_same_v<FloatT, float> )
        {
            return Math::Internal::OpenLibM::tanf( x );
        }
        else
        {
            return std::tan( x );
        }
    }
}

BasicMathBenchmarks results, at the bottom of this page, shows how the performance numbers were calculated. The runtime execution path for each function is selected, as shown, based on its performance in the benchmarks.

Basic operations

  • Abs which returns the absolute value \(|x|\) for x, is implemented for floating point types, signed integers and unsigned integers. Calls std::abs at runtime.

  • FMod which calculate the remainder of a floating point division operation, is implemented for floating point types.

    FMod outperforms std::fmod by approximately 60 % for double precision floating point values, and by 40 % for single precision floating point values.

  • Max which returns the greater of to values, is implemented for floating point types.

    Max calls std::max at compile time, and at runtime it calls _mm_max_ss for single precision floating point values, and std::max for double precision floating point values. This improves the performance, on the average, by 10 % for single precision floating point values. It varies between 2 % and 30 % for each run of the benchmarks.

  • Min which returns the lesser of to values, is implemented for floating point types.

    Min calls std::min at compile time, and at runtime it calls _mm_min_ss for single precision floating point values, and std::min for double precision floating point values. This improves the performance, on the average, by 7 % for single precision floating point values. It varies between -2 % and 20 % for each run of the benchmarks.

  • IsSameValue checks for binary equality between two floating point values.

Exponential functions

  • Exp returns e raised to the given power (ex).

    Exp outperforms std::exp by approximately 1200 % for double precision floating point values, and by approximately 1000 % for single precision floating point values.

    The maximum detected deviation between std::exp and Exp is 1.18844e-07 for single precision floating point values, and 2.18599e-16 for double precision floating point values, for argument values in the range -9 to 10, tested with a uniform random distribution of 10'000 values.

  • Log computes natural, base e, logarithm (ln x)

    Log outperforms std::log by approximately 260 % for double precision floating point values, and by approximately 400 % for single precision floating point values.

    The maximum detected deviation between std::log and Log is 1.18795e-07 for single precision floating point values, and 2.04848e-16 for double precision floating point values, for argument values in the range 0 to 100000, tested with a uniform random distribution of 10'000 values.

  • Log2 base 2 logarithm of the given number (log2x).

    Log2 outperforms std::log2 by approximately 270 % for double precision floating point values, and by approximately 250 % for single precision floating point values.

    The maximum detected deviation between std::log2 and Log2 is 1.18288e-07 for single precision floating point values, and 2.1882e-16 for double precision floating point values, for argument values in the range 0 to 100000, tested with a uniform random distribution of 10'000 values.

  • Log10 computes common (base 10) logarithm (log10x )

    Log10 outperforms std::log10 by approximately 360 % for double precision floating point values, and by approximately 390 % for single precision floating point values.

    The maximum detected deviation between std::log10 and Log10 is 1.18216e-07 for single precision floating point values, and 2.0154e-16 for double precision floating point values, for argument values in the range 0 to 100000, tested with a uniform random distribution of 10'000 values.

Power functions

  • Sqrt computes square root (√x)

    Calls _mm_sqrt_pd or _mm_sqrt_ps at runtime.

    Sqrt outperforms std::sqrt by approximately 1400 % for double precision floating point values, and by approximately 1300 % for single precision floating point values.

  • Hypot computes square root of the sum of the squares of two or three numbers.

    The two argument version of Hypot outperforms std::hypot by approximately 270 % for double precision floating point values, and by approximately 230 % for single precision floating point values.

    The three argument version of Hypot outperforms std::hypot by approximately 190 % for double precision floating point values, and by approximately 320 % for single precision floating point values.

Trigonometric functions

Graphic intensive application are highly sensitive to the performance of the trigonometric functions, especially for single precision floating point values.

  • Sin computes the sine of its argument given in radians.

    Sin calls std::sin at runtime for both single and double precision values.

    The constexpr path for Sin outperforms std::sin by approximately 100 % for single precision floating point values, but performs worse when both the sine and the cosine is calculated for the same value.

    The maximum deviation between std::sin and the constexpr path for Sin is 1.19182e-07 for single precision floating point values, and 2.22045e-16 for double precision floating point values, tested for all possible single precision floating point argument values for in the range -((2*pi)+epsilon) to ((2*pi)+epsilon).

  • Cos computes the cosine of its argument given in radians.

    Cos calls std::cos at runtime for both single and double precision values.

    The constexpr path for Cos outperforms std::cos by approximately 110 % for single precision floating point values, but performs worse when both the sine and the cosine is calculated for the same value.

    The maximum deviation between std::cos and the constexpr path for Cos is 1.19187e-07 for single precision floating point values, and 2.22044e-16 for double precision floating point values, tested for all possible single precision floating point argument values for in the range -((2*pi)+epsilon) to ((2*pi)+epsilon).

  • Tan computes the tangent of its argument given in radians.

    Tan outperforms std::tan by approximately 60 % for single precision floating point values,

    Tan calls std::tan at runtime, with a consistent performance penalty of about 20 % compared to calling std::tan directly for double precision floating point values.

    The maximum deviation between std::tan and the constexpr path for Tan is 1.19209e-07 for single precision floating point values, and 2.22045e-16 for double precision floating point values, tested for all possible single precision floating point argument values for in the range -((2*pi)+epsilon) to ((2*pi)+epsilon).

  • ASin computes arc sine of its argument.

    ASin outperforms std::asin by approximately 20 % for double precision floating point values, and by approximately 30 % for single precision floating point values.

    The maximum deviation between std::asin and ASin is 2.27673e-07 for single precision floating point values, and 2.22044e-16 for double precision floating point values, tested for all possible single precision floating point argument values for in the range -1.0 to 1.0.

  • ACos computes the arc cosine of its argument.

    ACos outperforms std::acos by approximately 30 % for double precision floating point values, and by approximately 50 % for single precision floating point values.

    The maximum deviation between std::acos and ACos is 1.19209e-07 for single precision floating point values, and 2.22044e-16 for double precision floating point values, tested for all possible single precision floating point argument values for in the range -1.0 to 1.0.

  • ATan computes the arc tangent of its argument.

    ATan outperforms std::atan by approximately 5 % for double precision floating point values, and by approximately 30 % for single precision floating point values.

    The maximum detected deviation between std::atan and ATan is 0.0 for single precision floating point values, and 0.0 for double precision floating point values, tested with a uniform random distribution of 10'000 values in the range -10'000 to 10'000.

  • ATan2 computes the arc tangent of x / y, its two arguments, using signs to determine quadrants.

    ATan2 outperforms std::atan2 by approximately 20 % for double precision floating point values, and by approximately 40 % for single precision floating point values.

    The maximum detected deviation between std::atan2 and ATan2 is 0.0 for single precision floating point values, and 0.0 for double precision floating point values, tested with a uniform random distribution of 10'000 values in the range -10'000 to 10'000.

Nearest integral value floating point operations

  • Ceil computes the nearest integral value not less than the given value.

    Ceil calls __ceil or __ceilf at runtime.

  • Floor computes the nearest integral value not greater than the given value.

    Floor calls __floor or __floorf at runtime.

  • Trunc computes the nearest integral value not greater in magnitude than the given value.

    Trunc calls __truncf at runtime for single precision floating point numbers, and calls _mm_round_pd double precision floating point numbers, improving performance by 320 %.

  • Round computes the nearest integral value, rounding away from zero in halfway cases.

    Round calls __roundf at runtime for single precision floating point numbers, and calls _mm_round_pd double precision floating point numbers, improving performance by 500 %.

Floating point manipulation functions

  • FRExp decomposes a number into significand and base-2 exponent.

    FRExp outperforms std::frexp by approximately 450 % for double precision floating point values, and by approximately 550 % for single precision floating point values.

  • ModF decomposes a number into integer and fractional parts.

    ModF outperforms std::modf by approximately 60 % for double precision floating point values, and by approximately 50 % for single precision floating point values.

  • ScaleByN multiplies a number by FLT_RADIX raised to a power.

    ScaleByN outperforms std::scalbn by approximately 90 % for double precision floating point values, and by approximately 170 % for single precision floating point values.

  • NextAfter next representable floating-point value towards the given value.

    NextAfter outperforms std::nextafter by approximately 40 % for double precision floating point values, and by approximately 90 % for single precision floating point values.

  • NextUp Return the smallest floating point number y of the same type as x such that x < y. If no such y exists, e.g. if x is Inf or NaN, then return x.

    The standard C++ implementation is std::nextafter( x, std::numeric_limits<double>::infinity( ) ).

    NextUp outperforms std::nextafter by approximately 1400 % for double precision floating point values, and by approximately 1300 % for single precision floating point values.

  • NextDown Return the largest floating point number y of the same type as x such that y < x. If no such y exists, e.g. if x is -Inf or NaN, then return x.

    The standard C++ implementation is std::nextafter( x, -std::numeric_limits<double>::infinity( ) ).

    NextDown outperforms std::nextafter by approximately 210 % for double precision floating point values, and by approximately 330 % for single precision floating point values.

  • CopySign copies the sign of a floating point value.

    NextDown outperforms std::nextafter by approximately 300 % for double precision floating point values, and by approximately 10 % for single precision floating point values.

Classification and comparison

  • IsNaN checks if the given number is NaN.

    IsNaN calls std::isnan at runtime.

  • IsInf checks if the given number is infinite.

    IsInf calls std::isinf for double precision floating point values, and outperforms std::isinf for single precision floating point values by 40 %.

  • SignBit checks if the given number is negative.

    SignBit outperforms std::signbit by approximately 50 % for both double and single precision floating point values.

Other computations

  • Clamp If the a value is within [minimumValue, maximumValue], the function returns the value, otherwise it returns the nearest boundary.

    Clamp calls std::clamp at runtime.

  • Lerp Computes the linear interpolation between a and b, if the parameter t is inside [​0​, 1), the linear extrapolation otherwise, i.e. the result of a + t * ( b - a ) with accounting for floating point calculation imprecision.

    Lerp calls std::lerp at runtime.

BasicMathBenchmarks Results

-------------------------------------------------------------------------------------------
Benchmark                               Time             CPU   Iterations  Improvement
-------------------------------------------------------------------------------------------
BenchmarkDoubleGenerator             2.37 ns         1.51 ns    640000000
BenchmarkFloatGenerator              1.80 ns         1.15 ns    640000000
BenchmarkDoubleIsSameValue           2.17 ns         1.57 ns    448000000
BenchmarkFloatIsSameValue            2.69 ns         1.95 ns    344615385
BenchmarkDoubleIsZero                2.64 ns         1.60 ns    497777778
BenchmarkFloatIsZero                 1.90 ns         1.15 ns    746666667
BenchmarkDoubleIsNaN                 2.63 ns         1.67 ns    373333333
BenchmarkDoubleStdIsNaN              1.70 ns         1.22 ns    640000000
BenchmarkFloatIsNaN                  2.36 ns         1.57 ns    497777778
BenchmarkFloatStdIsNaN               1.89 ns         1.05 ns    746666667
BenchmarkDoubleSignum                2.13 ns         1.55 ns    373333333
BenchmarkDoubleNaiveSignum           1.97 ns         1.60 ns    448000000
BenchmarkFloatSignum                 2.28 ns         2.01 ns    373333333
BenchmarkFloatNaiveSignum            2.05 ns         1.35 ns    497777778
BenchmarkDoubleDeg2Rad               1.72 ns         1.28 ns    560000000
BenchmarkFloatDeg2Rad                1.46 ns         1.15 ns    640000000
BenchmarkDoubleRad2Deg               1.67 ns         1.12 ns    640000000
BenchmarkFloatRad2Deg                1.51 ns         1.17 ns    746666667
BenchmarkDoubleNextAfter             6.40 ns         5.47 ns    100000000  (( 7.67 - 5.47)/5.47)*100 = 40.22 %
BenchmarkDoubleStdNextAfter          10.2 ns         7.67 ns     89600000
BenchmarkFloatNextAfter              4.42 ns         3.66 ns    179200000  (( 6.98 - 3.66)/3.66)*100 = 90.71 %
BenchmarkFloatStdNextAfter           10.9 ns         6.98 ns     89600000
BenchmarkDoubleInternalSqrt          35.2 ns         24.9 ns     21333333
BenchmarkDoubleSqrt                  1.85 ns         1.33 ns    448000000  (( 20.5 - 1.33)/1.33)*100 = 1441.35 %
BenchmarkDoubleStdSqrt               25.8 ns         20.5 ns     37333333
BenchmarkFloatInternalSqrt           13.3 ns         10.2 ns    100000000
BenchmarkFloatSqrt                   1.66 ns         1.26 ns    560000000  (( 18.0 - 1.26)/1.26)*100 = 1328.57 %
BenchmarkFloatStdSqrt                23.3 ns         18.0 ns     37333333
BenchmarkDoubleNextDown              2.56 ns         1.97 ns    373333333  (( 6.28 - 1.97)/1.97)*100 = 218.78 %
BenchmarkDoubleStdNextDown           10.6 ns         6.28 ns     89600000
BenchmarkFloatNextDown               2.39 ns         1.81 ns    560000000  (( 7.81 - 1.81)/1.81)*100 = 331.49 %
BenchmarkFloatStdNextDown            10.5 ns         7.81 ns    112000000
BenchmarkDoubleNextUp                2.95 ns         1.88 ns    373333333  (( 7.15 - 1.88)/1.88)*100 = 280.32 %
BenchmarkDoubleStdNextUp             9.68 ns         7.15 ns     89600000
BenchmarkFloatNextUp                 2.67 ns         1.90 ns    560000000  (( 9.42 - 1.90)/1.90)*100 = 395.79 %
BenchmarkFloatStdNextUp              12.6 ns         9.42 ns     89600000
BenchmarkDoubleIsInf                 1.78 ns         1.36 ns    448000000  (( 1.40 - 1.36)/1.36)*100 = 2.94 %
BenchmarkDoubleStdIsInf              2.00 ns         1.40 ns    448000000
BenchmarkFloatIsInf                  1.43 ns         1.03 ns   1000000000  (( 1.48 - 1.03)/1.03)*100 = 43.69 %
BenchmarkFloatStdIsInf               1.88 ns         1.48 ns    896000000
BenchmarkDoubleInternalAbs           1.99 ns         1.51 ns    497777778
BenchmarkDoubleAbs                   1.65 ns         1.20 ns    560000000  (( 1.29 - 1.20)/1.20)*100 = 7.5 %
BenchmarkDoubleStdAbs                1.73 ns         1.29 ns    448000000
BenchmarkFloatInternalAbs            2.21 ns         1.19 ns    448000000  
BenchmarkFloatAbs                    1.62 ns         1.22 ns    497777778  (( 1.26 - 1.22)/1.22)*100 = 3.28 %
BenchmarkFloatStdAbs                 1.69 ns         1.26 ns    560000000
BenchmarkDoubleSignBit               1.72 ns         1.29 ns    896000000  (( 1.94 - 1.29)/1.29)*100 = 50.39 %
BenchmarkDoubleStdSignBit            2.94 ns         1.94 ns    298666667
BenchmarkFloatSignBit                1.92 ns         1.40 ns    448000000  (( 2.20 - 1.40)/1.40)*100 = 57.14 %
BenchmarkFloatStdSignBit             2.99 ns         2.20 ns    298666667
BenchmarkDoubleFRExp                 3.26 ns         2.29 ns    320000000  (( 12.7 - 2.29)/2.29)*100 = 454.59 %
BenchmarkDoubleStdFRExp              16.9 ns         12.7 ns     89600000
BenchmarkFloatFRExp                  3.09 ns         2.02 ns    448000000  (( 13.5 - 2.02)/2.02)*100 = 568.32 %
BenchmarkFloatStdFRExp               19.7 ns         13.5 ns     49777778
BenchmarkDoubleModF                  3.67 ns         2.58 ns    344615385  (( 4.14 - 2.58)/2.58)*100 = 60.47 %
BenchmarkDoubleStdModF               6.09 ns         4.14 ns    248888889
BenchmarkFloatModF                   3.51 ns         2.26 ns    373333333  (( 3.35 - 2.26)/2.26)*100 = 48.23 %
BenchmarkFloatStdModF                5.09 ns         3.35 ns    186666667
BenchmarkDoubleMin                   1.70 ns         1.52 ns    576735632  (( 1.50 - 1.52)/1.52)*100 = -1.32 %
BenchmarkDoubleStdMin                1.78 ns         1.50 ns    407272727
BenchmarkFloatMin                    1.80 ns         1.26 ns    448000000  (( 1.33 - 1.26)/1.26)*100 = 5.56 %
BenchmarkFloatStdMin                 1.65 ns         1.33 ns    448000000
BenchmarkDoubleMax                   1.61 ns         1.39 ns    640000000  (( 1.37 - 1.39)/1.39)*100 = -1.43 %
BenchmarkDoubleStdMax                1.72 ns         1.37 ns    593798817
BenchmarkFloatMax                    1.68 ns         1.14 ns    560000000  (( 1.46 - 1.14)/1.14)*100 = 28.07 %
BenchmarkFloatStdMax                 1.86 ns         1.46 ns    373333333
BenchmarkDoubleTrunc                 1.72 ns         1.45 ns    344615385  (( 6.09 - 1.45)/1.45)*100 = 320 %
BenchmarkDoubleStdTrunc              8.13 ns         6.09 ns    100000000
BenchmarkFloatTrunc                  1.66 ns         1.38 ns    407272727  (( 1.31 - 1.38)/1.38)*100 = -5.07 %
BenchmarkFloatStdTrunc               1.63 ns         1.31 ns    560000000
BenchmarkDoubleFloor                 1.77 ns         1.33 ns    448000000  (( 1.09 - 1.33)/1.33)*100 = -18.04 %
BenchmarkDoubleStdFloor              1.49 ns         1.09 ns    560000000
BenchmarkFloatFloor                  1.67 ns         1.31 ns    560000000  (( 1.16 - 1.31)/1.31)*100 = -11.45 %
BenchmarkFloatStdFloor               1.69 ns         1.16 ns    497777778
BenchmarkDoubleCeil                  1.75 ns         1.23 ns    746666667  (( 1.40 - 1.23)/1.23)*100 = 13.82
BenchmarkDoubleStdCeil               1.87 ns         1.40 ns    560000000
BenchmarkFloatCeil                   1.82 ns         1.24 ns    669013333  (( 1.27 - 1.24)/1.24)*100 = 2.42 %
BenchmarkFloatStdCeil                2.00 ns         1.27 ns    640000000
BenchmarkDoubleRound                 1.78 ns         1.39 ns    640000000  (( 9.63 - 1.39)/1.39)*100 = 592.8 %
BenchmarkDoubleStdRound              11.9 ns         9.63 ns     74666667
BenchmarkFloatRound                  1.72 ns         1.28 ns    560000000  (( 2.09 - 1.28)/1.28)*100 = 63.28 %
BenchmarkFloatStdRound               2.47 ns         2.09 ns    373333333
BenchmarkDoubleClamp                 2.23 ns         1.46 ns    448000000  (( 1.67 - 1.46)/1.46)*100 = 14.38 %
BenchmarkDoubleStdClamp              2.04 ns         1.67 ns    746666667
BenchmarkFloatClamp                  2.09 ns         1.59 ns    560000000  equal
BenchmarkFloatStdClamp               2.41 ns         1.59 ns    344615385
BenchmarkDoubleInternalLerpImpl      3.43 ns         2.46 ns    280000000  (( 8.23 - 2.46)/2.46)*100 = 234.55 %
BenchmarkDoubleLerp                  8.67 ns         6.17 ns    111502223  (( 8.23 - 6.17)/6.17)*100 = 33.39 %
BenchmarkDoubleStdLerp               8.73 ns         8.23 ns    112000000
BenchmarkFloatInternalLerpImpl       3.24 ns         2.58 ns    224000000  (( 7.53 - 2.15)/2.15)*100 = 250.23 %
BenchmarkFloatLerp                   8.15 ns         7.53 ns    112000000  (( 7.53 - 7.53)/7.53)*100 = 0 %
BenchmarkFloatStdLerp                8.67 ns         7.53 ns    112000000
BenchmarkDoubleCopySign              1.75 ns         1.26 ns    497777778  (( 5.16 - 1.26)/1.26)*100 = 309.52 %
BenchmarkDoubleStdCopySign           6.30 ns         5.16 ns    112000000
BenchmarkFloatCopySign               1.69 ns         1.23 ns    407272727  (( 1.36 - 1.23)/1.23)*100 = 10.57 %
BenchmarkFloatStdCopySign            1.77 ns         1.36 ns    448000000
BenchmarkDoubleScaleByN              4.29 ns         3.68 ns    203636364  (( 7.15 - 3.68)/3.68)*100 = 94.29 %
BenchmarkDoubleStdScaleByN           10.4 ns         7.15 ns     89600000
BenchmarkFloatScaleByN               3.37 ns         2.62 ns    280000000  (( 7.26 - 2.62)/2.62)*100 = 177.1 %
BenchmarkFloatStdScaleByN            10.1 ns         7.26 ns     92490323
BenchmarkDoubleFMod                  9.71 ns         6.25 ns    100000000  (( 6.98 - 6.25)/6.25)*100 = 11.68 %
BenchmarkDoubleStdFMod               10.2 ns         6.98 ns     89600000
BenchmarkFloatFMod                   10.2 ns         8.54 ns     64000000  (( 9.00 - 8.54)/8.54)*100 = 5.39 %
BenchmarkFloatStdFMod                10.5 ns         9.00 ns     74666667
BenchmarkDoubleInternalExpImpl       4.10 ns         3.31 ns    179200000
BenchmarkDoubleExp                   4.38 ns         3.52 ns    248888889  (( 47.4 - 3.52)/3.52)*100 = 1246.59 %
BenchmarkDoubleStdExp                53.1 ns         47.4 ns     11200000
BenchmarkFloatInternalExpImpl        4.98 ns         3.81 ns    172307692
BenchmarkFloatExp                    4.75 ns         3.72 ns    172307692  (( 40.8 - 3.72)/3.72)*100 = 996.77 %
BenchmarkFloatStdExp                 53.0 ns         40.8 ns     14933333
BenchmarkDoubleInternalHypot         7.47 ns         5.16 ns    112000000
BenchmarkDoubleHypot                 1.79 ns         1.54 ns    640000000  (( 5.72 - 1.54)/1.54)*100 = 271.42 %
BenchmarkDoubleStdHypot              6.90 ns         5.72 ns    112000000
BenchmarkFloatInternalHypot          6.35 ns         5.72 ns    112000000
BenchmarkFloatHypot                  2.00 ns         1.51 ns    497777778  (( 5.00 - 1.51)/1.51)*100 = 231.13 %
BenchmarkFloatStdHypot               6.95 ns         5.00 ns    100000000
BenchmarkDoubleHypot3                3.28 ns         2.23 ns    280000000  (( 6.56 - 2.23)/2.23)*100 = 194.17 %
BenchmarkDoubleStdHypot3             7.20 ns         6.56 ns    112000000
BenchmarkFloatHypot3                 2.04 ns         1.46 ns    448000000  (( 6.26 - 1.46)/1.46)*100 = 328.77 %
BenchmarkFloatStdHypot3              6.81 ns         6.26 ns    172307692
BenchmarkDoubleInternalLog           7.62 ns         5.87 ns    138416552
BenchmarkDoubleLog                   7.05 ns         5.72 ns    112000000  (( 20.9 - 5.72)/5.72)*100 = 265.38 %
BenchmarkDoubleStdLog                29.8 ns         20.9 ns     34461538
BenchmarkFloatInternalLog            6.12 ns         5.16 ns    112000000
BenchmarkFloatLog                    5.79 ns         4.74 ns    112000000  (( 23.4 - 4.74)/4.74)*100 = 393.67 %
BenchmarkFloatStdLog                 29.3 ns         23.4 ns     28000000
BenchmarkDoubleInternalLog2          8.22 ns         6.00 ns    112000000
BenchmarkDoubleLog2                  8.36 ns         6.63 ns     89600000  (( 24.7 - 6.63)/6.63)*100 = 272.55 %
BenchmarkDoubleStdLog2               35.2 ns         24.7 ns     37333333
BenchmarkFloatInternalLog2           6.04 ns         4.19 ns    149333333
BenchmarkFloatLog2                   6.29 ns         5.62 ns    100000000  (( 20.1 - 5.62)/5.62)*100 = 257.65 %
BenchmarkFloatStdLog2                30.1 ns         20.1 ns     28000000
BenchmarkDoubleInternalLog10         8.37 ns         7.32 ns     89600000
BenchmarkDoubleLog10                 8.27 ns         6.28 ns    112000000  (( 28.9 - 6.28)/6.28)*100 = 360.19 %
BenchmarkDoubleStdLog10              32.5 ns         28.9 ns     24888889
BenchmarkFloatInternalLog10          6.59 ns         4.96 ns    154482759
BenchmarkFloatLog10                  6.31 ns         4.74 ns    112000000  (( 23.4 - 4.74)/4.74)*100 = 393.67 %
BenchmarkFloatStdLog10               29.9 ns         23.4 ns     32000000
BenchmarkDoubleInternalSin           10.7 ns         8.37 ns     74666667
BenchmarkDoubleSin                   7.43 ns         5.45 ns    152048486  (( 6.70 - 5.45)/5.45)*100 = 22.94 %
BenchmarkDoubleStdSin                7.73 ns         6.70 ns    112000000
BenchmarkFloatInternalSin            4.76 ns         4.01 ns    179200000
BenchmarkFloatSin                    4.81 ns         3.45 ns    248888889  (( 6.98 - 3.45)/3.45)*100 = 102.32 %
BenchmarkFloatStdSin                 7.75 ns         6.98 ns    112000000
BenchmarkDoubleInternalCos           9.63 ns         7.97 ns    100000000
BenchmarkDoubleCos                   8.62 ns         6.80 ns     89600000  (( 7.19 - 6.80)/6.80)*100 = 5.73 %
BenchmarkDoubleStdCos                8.53 ns         7.19 ns    100000000
BenchmarkFloatInternalCos            5.67 ns         4.17 ns    194782609
BenchmarkFloatCos                    4.68 ns         3.59 ns    208849115  (( 7.66 - 3.59)/3.59)*100 = 113.37 %
BenchmarkFloatStdCos                 8.58 ns         7.66 ns    100000000
BenchmarkDoubleInternalTan           17.2 ns         14.8 ns     56000000
BenchmarkDoubleTan                   12.3 ns         9.49 ns     56000000  (( 5.93 - 9.49)/9.49)*100 = -37.51 %
BenchmarkDoubleStdTan                8.39 ns         5.93 ns     89600000
BenchmarkFloatInternalTan            5.79 ns         4.88 ns    112000000
BenchmarkFloatTan                    5.73 ns         5.08 ns    160000000  (( 8.23 - 5.08)/5.08)*100 = 62.0 %
BenchmarkFloatStdTan                 9.75 ns         8.23 ns    112000000
BenchmarkDoubleInternalATan          8.41 ns         6.25 ns    100000000
BenchmarkDoubleATan                  8.17 ns         5.62 ns    100000000  (( 6.10 - 5.62)/5.62)*100 = 8.43 %
BenchmarkDoubleStdATan               8.59 ns         6.10 ns     89600000
BenchmarkFloatInternalATan           5.61 ns         3.48 ns    165925926
BenchmarkFloatATan                   5.89 ns         4.00 ns    160000000  (( 5.31 - 4.00)/4.00)*100 = 32.75 %
BenchmarkFloatStdATan                7.01 ns         5.31 ns    100000000
BenchmarkDoubleInternalASin          8.99 ns         6.28 ns     89600000
BenchmarkDoubleASin                  8.63 ns         5.78 ns    100000000  (( 6.98 - 5.78)/5.78)*100 = 20.76 %
BenchmarkDoubleStdASin               8.71 ns         6.98 ns     89600000
BenchmarkFloatInternalASin           4.69 ns         3.90 ns    172307692
BenchmarkFloatASin                   4.91 ns         3.61 ns    194782609  (( 4.74 - 3.61)/3.61)*100 = 31.30 %
BenchmarkFloatStdASin                7.07 ns         4.74 ns    112000000
BenchmarkDoubleInternalACos          7.37 ns         6.25 ns    100000000
BenchmarkDoubleACos                  7.15 ns         5.00 ns    100000000  (( 6.80 - 5.00)/5.00)*100 = 36.0 %
BenchmarkDoubleStdACos               8.74 ns         6.80 ns     89600000
BenchmarkFloatInternalACos           5.72 ns         5.16 ns    100000000
BenchmarkFloatACos                   5.39 ns         3.72 ns    192984615  (( 5.86 - 3.72)/3.72)*100 = 57.53 %
BenchmarkFloatStdACos                6.93 ns         5.86 ns    112000000
BenchmarkDoubleInternalATan2         12.7 ns         10.8 ns     89600000
BenchmarkDoubleATan2                 13.8 ns         12.2 ns     44800000  (( 15.1 - 12.2)/12.2)*100 = 23.77 %
BenchmarkDoubleStdATan2              20.9 ns         15.1 ns     37333333
BenchmarkFloatInternalATan2          10.4 ns         7.32 ns     74666667
BenchmarkFloatATan2                  9.75 ns         7.95 ns    112000000  (( 11.6 - 7.95)/7.95)*100 = 45.91 %
BenchmarkFloatStdATan2               13.4 ns         11.6 ns     49777778