Re: largest finite float

Well...just goes to show what happens when it's late and you don't go back
to original sources.  (Just think, if we'd done that originally, we might
not have gotten into this mess.)  OK:  I was tired.  It was late.  I don't
have an electronic copy of 754 and couldn't find my paper copy.

(BTW, Henry's msg calling me on this arrived while I was writing this one.
So at least I'd caught it myself.  For whatever that's worth.)

At 10:47 PM -0500 1/23/02, I wrote:
>At 4:58 PM -0500 1/23/02, wrote:

>>A small correction here - the maximum value of e in that formula is 
>>actually 104 (the minimum value is -149), so I believe the largest 
>>finite float value is 2^128 - 2^104, which is approximately 
>>3.4028x10^38.  (I'll refer to it as M below.)
>Almost.  Depending on how old that paper of mine was--there was a time when
>I was confused about how 754 approached this matter.

>     o	For normalized *positive* numbers, (since one bit is m's sign bit,
>	there are 23 left), this means

>	o    2*22 <= m <= 2*23 - 1
>     o	For e = 0, this would result in normalized positive numbers of the
>	form m * 2**e running from 2**22 to 2**23 - 1 .  Then varying e
>	equally on either side of 0 would bias things heavily in favor of
>	large numbers.
>     o	What they want is, for exponent zero, to have numbers close to and
>	just less than 1.  This requires that you bias the exponent by -23.
>     o	This means the number represented by m and e is (m * 2**e * 2**-23)
>Therefore the largest number representable is
>	(2**23 - 1) * 2**127 * 2**-23
>which is
>	(2**23 - 1) * 2**104,  AKA  (2**127 - 2**104)
>Net result is that Michael (and undoubtably me, back then) didn't bias the
>exponent--and Michael, me back then, and Henry all failed to account for
>the sign bit.  I leave it to those with time and calculator packages to
>work out the decimal representation.

Well...I accounted for the sign bit.  But (because I didn't reread the
original and depended on stale memory) I didn't account for a trick of
hiding an extra bit encoded into e:  Note that the top bit, except for
very small ("unnormalized" or "subnormalized") numbers, is always 1.
Why store it?  Its only function is to differentiate between normalized
and subnormalized numbers.  Subnormalized numbers can be detected by
the exponent value.  Therefore we really have 25 bits for m:  One sign
bit, one implicit bit, and 23 real bits, for a total of 24 non-sign bits.
Hence  2*23 <= m <= 2*24 - 1 .  Rerun the calculations, and see that
Henry's value of  2**128 - 2**104  is correct.

>Because of the bias, if you add more bits to the m (sorry, but it ain't a
>mantissa) you also increase the bias, no net gain.  To get larger numbers
>you must increase the exponent's bits.  Therefore, the next larger number
>would be
>	(2**22) * 2**(127 + 1) * 2**-23,  AKA  (2**127)

And that becomes  2**128 .

I think I've got it right this time.  Sigh....

Dave Peterson

Received on Thursday, 24 January 2002 10:45:52 UTC