Fp64 software emulation

1/4/2024

In the comments above, Robert Crovella also pointed to a GTC 2015 talk by Scott LeGrand, which I haven't had time to check out yet.Īs for accuracy, double-float has a representational precision of 49 (24+24+1) bits, compared with IEEE-755 double which provides 53 bits. I gave a brief overview of easily available literature on such methods in a recent posting in the NVIDIA developer forums. Instead one can use float computation, augmented by error compensating techniques, one of the oldest of which is the Kahan summation. Note that in various applications, full double-float arithmetic may not be necessary. x = tailĭblfloat mul_dblfloat (dblfloat x, dblfloat y)

You can easily measure this yourself, in the context relevant to you, i.e. A reasonably conservative estimate may therefore be that double-float arithmetic performs at 1/20 the throughput of native float arithmetic. However, the instruction sequences for double-float operations also require temporary variables, which increases register pressure and can decrease occupancy. The addition requires around 20 float instructions. NVIDIA's double-double code supports the operations addition, subtraction, division, square root, and reciprocal square root.Īs you can see, the multiplication below requires 8 float instructions unary negation is absorbed into FMA. If you are a registered CUDA developer you can download double-double code from NVIDIA's developer website (log in at ) which is under BSD license, and rework it relatively quickly into double-float code. From previous analysis I believe the addition code given in the paper is correct, and that it avoids common pitfalls in faster but less accurate implementations (which lose accuracy when the magnitude of the operands is within a factor of two). For double-float addition code, I would point you to a paper by Andrew Thall as I do not have the time to code this up right now. I am showing a double-float multiplication below that takes full advantage of FMA (fused multiply-add) support on the GPU. You would want to inspect binary code with cuobjdump -dump-sass to get an accurate count.

V1.1.20 new function for better math.You can get a rough estimate of the performance by counting the number of float operations required to implement each double-float operation.
Library reference can be found in the documentation section. Source code of the library is available on GitHub, it can be downloaded here as an plug & play Arduino library. But usally, you will only notice that your code runs 80% faster. Beware however, that due to the extended support of IEE 754, that the behaviour of your program might differ slightly. To make conversion from your previous avr_f64.c project easy, a avr_fp64.h header file is supplied that converts calls to avr_f64 routine to calls to fp64lib routines.
Logarithmic and hyperbel function: fp64_log(), fp64_exp(), fp64_log10(), fp64_sinh(), fp64_cosh(), fp64_tanh(), fp64_ldexp(), fp64_frexp(), fp64_pow(), fp64_cbrt()įurthermore, the library is mostly compatible with the avr_f64.c library.
Conversion functions from and to string: fp64_to_decimalExp(), fp64_strtod(), fp64_to_string().
Conversion functions from and to float/double: fp64_sd(), fp64_ds().
Library is fully compatible to usual “math.h” routines, e.g. All fp64lib routines start with “fp64_”, e.g.

The library comes with a math.h compatible head file named “fp64-math.h”. Therefore, rounding modes cannot be controlled.Out of the five rounding modes, only Round to nearest, ties to even is implemented.There are no signaling NaNs, only quite NaNs.To limit code size, not all features of IEEE 754 are implemented. Significand precision: 53 bits (52 explicitly stored).Data format is fully compatible with IEEE 754 binary64 standard (see Wikipedia): Fp64lib is a library for implementing 64-bit floating point arithmetic on the AVR MegaAVR architecure microprocessors, like the popular Arduino series.

0 Comments

BLOG

Fp64 software emulation

Leave a Reply.

Author

Archives

Categories