www.DataSheet.co.kr

IEEE 754 Compliant Floating-Point Routines

ROUNDING METHODS

Truncation of a binary representation to n bits is severely

biased since it always leads to a number whose magni-

tude is less than or equal to that of the exact value,

thereby possibly causing significant error buildup during

a long sequence of calculations. Simple adder-based

rounding by adding the NSB to the LSB is unbiased

except when the value to be rounded is equidistant from

the two nearest n bit values[1]. In this case, magnitudes

are always rounded up thereby producing a small but still

undesirable bias. This can be removed by stipulating

that in the equidistant case, the n bit value with LSB=0

is selected, commonly referred to as the rounding to the

nearest method, the default mode in the IEEE754 stan-

dard[4,5]. The number of guard bits or extra bits of

precision, is related to the sensitivity of the rounding

method since using more guard bits results in fewer

equidistant cases to be resolved. Since more than one

guard bit requires an extra byte in PIC16/17 arithmetic,

only one guard bit, usually handled in the carry bit, is

employed in this library of floating point routines. Near-

est neighbor rounding with one guard bit leads to the

following simple result:

n Bit Value Guard Bit

A0

A1

A+1 0

Result

round to A

if A,LSB=0, round to A

if A,LSB=1, round to A+1

round to A+1

Another interesting rounding method, is Von Neumann

rounding or jamming, where the exact number is trun-

cated to n bits and then set LSB=1. Although the errors

can be twice as large as in round to the nearest, it is

unbiased and requires little more effort than trunca-

tion[1].

FLOATING POINT FORMATS

In what follows, we use the following floating point

formats:

eb radix f0

point

f1 f2

IEEE754 xxxxxxxx

32-bit

truncated xxxxxxxx

24-bit

.

.

Sxxxxxxx xxxxxxxx xxxxxxxx

Sxxxxxxx xxxxxxxx

where eb is the biased 8-bit exponent, with bias=27=128=

0x80, S is the sign bit, and bytes f0, f1 and f2 constitute

the mantissa with f0 the most significant byte with implicit

MSB = 1. It is important to note that the IEEE754

standard format[4] places the sign bit as the MSB of eb

with the LSB of the exponent as the MSB of f0. Because

of the inherent byte structure of the PIC16/17 family of

microcontrollers, more efficient code was possible by

adopting the above formats rather than strictly adhering

to the IEEE standard.

The limiting absolute values of the above floating point

formats are given as follows:

|A|

32-bit format eb e

f

decimal

MAX 0xFF7FFFFF FF 7F 7FFFFF 1.7014117E+38

MIN 0x01000000 01 81 000000 2.9387359E-39

where the MSB is implicitly equal to one, and its bit

location is occupied by the sign bit. The 24-bit format has

the same structure but with only a 16-bit mantissa. While

24- to 32-bit conversion is trivial, requiring only an

additional zero byte in the mantissa, a 32- to 24-bit

conversion routine would typically employ nearest neigh-

bor rounding before truncation.

To produce the correct representation of a particular

decimal number, a high-level language compiler and

debugger could be used to display the internal binary

representation on a host computer and make the appro-

priate conversion to the above format. If this approach is

not feasible, algorithms for producing this representa-

tion are contained in Appendix A.

DS00575A-page 2

5-12

© 1994 Microchip Technology Inc.

Datasheet pdf - http://www.DataSheet4U.net/