Posts Tagged floating-point

What’s wrong with floating-point?

Every so often, floating-point data types come under renewed criticism. There are plenty of good reasons to whisper behind their backs, but are they really all bad?

The basic problem is that floating-point types represent any value as an approximation. It works really well too. In fact, for most whole numbers that you’re likely to encounter the approximation is exact, so 1 is always 1 and 10124 is always 10124.

It gets messy in two scenarios. First, when the value is greater than the precision for your floating point type you start losing least significant digits. The precision for Double is about sixteen digits, so this is the kind of problems you run into:

MyVal := 12345678901234567; // Seventeen digits
if MyVal = 12345678901234567 then
  ShowMessage('This doesn'' show')
else
  ShowMessage('But this does');

Note that Double can represent far larger numbers than this, but only at this 53 bit precision. Integer on the other hand has a fixed upper value (MaxInt) after which it wraps around and your huge number is suddenly very small. Try it:

MyVal := MaxInt + 1;
if MyVal = -MaxInt-1 then
 ShowMessage('Where''d my big number go?');

The second problem – and the one I believe is more common – arises when working with fractions. The problem here is that not all fractions can be represented in a finite number of digits. Working with decimal (base 10) numbers, you can immediately point to 1√∑3. This hands you a value that repeats forever and will never be completely accurate, no matter how many threes you add to the end of it. The same happens in binary systems and on numbers you don’t expect – for example, the decimal value 0.1 repeats forever when you convert it to binary. And that means that the approximation is off and all calculations using that value are inaccurate.

It works often enough that many programmers get away with code that compares floating point values using the equality operators (=, <, >). In fact, when you round it for display or storage the value is normally spot-on what you expect it to be. Things go well until you compare two values that should be the same but were calculated in different ways. To remedy this, one should always compare values within ranges, like so:

if abs(x - y) < 0.0001 then
  DoSomething();

The above is a little unintuitive, and Delphi now sports a couple of functions to make it all look pretty:

function CompareValue(const A: Double; const B: Double; Epsilon: Double = 0): TValueRelationship; overload;
function IsZero(const A: Double; Epsilon: Double = 0): Boolean; overload;
function SameValue(const A: Double; const B: Double; Epsilon: Double = 0): Boolean; overload;

Each of these have a few overloads to safely work with Single and Extended values as well. Think of these as you would of utility functions that compare the contents of objects – a necessary way of dealing with these data types. Direct use of the equality operators should be banned for safety’s sake.

So, how about the alternatives? The advice most often given is to store real values as integers, fractions or BCD values.

Integers can be used to implement a fixed-point number, but pose a problem in their limited scale and ease of use. If you’re going to work with fewer that ten digits in total, an Integer could work just fine. As long as you remember to consistently multiply, divide and round as needed. See the problem? For most values that this works for, plain old Double will probably give you results that are at least as good with far less greying of the hair. You could of course wrap this in a record with some overloaded operators, but I have personally not had the need.

Currency is a special Integer case, which is implemented as a 64 bit signed Integer with four places after the decimal point.

A lot of the inaccuracy with floating-point arithmetic arise when dividing two integer values. Popular solution to this? Just keep the original integer values in a record structure. You know what this means don’t you? Even more maintenance work than the fixed-point integer solution listed above. You need to make sure when adding and subtracting these numbers that you work with a common denominator and probably write some code to simplify your fractions. To make matters worse this can only store rational numbers, so PI will be a very poor approximation. Again, this has never seemed like a solution worth implementing to me.

Finally, there is BCD. Every four binary fits store a single decimal digit, so four bits can store up to decimal¬† nine, eight up to ninety-nine and so on. BCD can store huge numbers. No, even bigger than that. In the Delphi implementation, a total of sixty-four digits. That’s a lot. The tradeoff is memory footprint and also some performance penalty because every BCD value takes a whopping 34 bytes of memory, which could be a lot if you’re passing them on the stack as parameters to functions. Moreover, BCD isn’t a pleasure to work with natively – you need special functions to add them, multiply them, even assign them. It gets a little easier (and still less efficient) if you declare and manipulate them as Variants. Check out the FmtBcd unit for info on working in high precision.

Personally, I use Double as my workhorse type for real numbers with Extended as my backup. They’re more than accurate enough for nearly all scenarios, give adequate performance and use clean, simple syntax. Life is Zen as long as you use SameValue, CompareValue and IsZero religiously. I reserve BCD for cases where exceptional many-decimal accuracy is a must and use Integer for integers only.

What heuristics do you apply?

Advertisements

Leave a Comment