Floating point proposal from left field... [long] from Edward Jason Riedy on 2000-01-11 (www-xml-schema-comments@w3.org from January to March 2000)

From: Edward Jason Riedy <ejr@CS.Berkeley.EDU>
Date: Tue, 11 Jan 2000 00:20:08 -0800
To: www-xml-schema-comments@w3.org
Message-Id: <200001110820.AAA10026@lotus.CS.Berkeley.EDU>
The section on floating-point values in the schema proposal seems to
have changed significantly a few times; might I propose one more
change?  The change is to unify the float and double types under one
floating-point type.  The floating-point type is a bit vague, but then
float and double are concretely described through two additional
facets.  The number of significant figures is also introduced as a FP
facet.

I'm just a grad student in numerical computing, and I happen to like
the prospects of XML, so I've been reading some of the proposals more
relevant to my current work.  I have to deal with some file formats
that embed Fortran FORMAT statements to describe their data.  Yuck.
I'd love to replace them, but I need to convince the people above me
that XML offers more than what they currently have.  And I'm playing
with XML-RPC, so I followed the XML-RPC -> SOAP -> XML Schema chain to
get here.  I hope I'm not coming from too far out...

Note: I'm sticking to your use of the term `precision' for the number
of significant digits.  I'd _greatly_ prefer to see the terminology
changed, but that may be too much.  In my mind, `precision' is a
quality of a measurement with a definition that varies according to
how the measurement is taken.  I realize that to others, it's
something else.  However, the definition of ``significant digits'' is
clear to everyone.  _That's_ the reason why I believe `precision'
should be replaced by some randomly capitalized version of
``significant digits.''  Yes, I know what term SQL uses.  It's yet
another aspect of SQL that should not be emulated.

              ========================================

Proposal:  [this vaguely takes the form of entries in the draft, but
the language doesn't quite match]

Facets:

* bitsMantissa, bitsExponent:

Definition:  These facets can restrict the number of bits necessary to
represent the value space of a floating-point subtype.  Both must be
positive integers.

[Rationale:

This makes the definition of floating-point, float, and double exact
while retaining extensibility.  

There are efforts underway to provide wider support for wider
floating-point numbers, and they are being well-received in the
scientific computing community.  Also, the dominant hardware platform
(x86) provides extended precision support in hardware.  It would be
very nice to be able to communicate extended-precision numbers in a
standard manner.

However, this must be balanced with the difficulty of correctly
implementing extended-precision i/o.  The floating-point default
widths are specified below to be at least those of IEEE double.
Support for extended precision should remain a quality of
implementation issue for at least a few years, but the standard should
be written to accommodate more flexible implementations (imho).

It is possible to specify any FP number through a composite type of an
integral mantissa and an integral exponent with appropriate ranges on
both, but that's almost as obtuse as the FORMATs I want to exterminate.
]

* precision:

[add support for floating-point]

[Rationale:

This is how the majority of programmers and programs work with
floating-point i/o, so this should be included even if the bit-width
facets are not.  I'd imagine most people would be happy with a sloppy
floating-point type and only some way to specify the number of
significant digits.
]

* {min,max}{Inclusive,Exclusive}

[same, essentially, plus...]

Note that these facets must be interpreted carefully in
floating-point.  The upper and lower bounds will be applied to the
value space defined by the most specific bitsMantissa and bitsExponent
facets.  If rounding is necessary, the inclusive bounds must be
rounded _away_ from the interval, while the exclusive bounds must be
rounded _into_ the interval.  The value not-a-number is implicitly
included in every range except when an exclusive interval is empty
when applied to the correct value space.

[Rationale:

This is difficult, but it's difficult even without the
bits{Mantissa,Exponent} facets.  While I like to consider myself at
least a little knowledgeable on FP issues, I'm not at all sure on the
subtleties of interval arithmetic.  See
   http://www.cs.utep.edu/interval-comp/main.html
for many resources, especially the papers at
   http://www.mscs.mu.edu/~globsol/walster-papers.html

The section of 
   http://www.mscs.mu.edu/~globsol/Papers/spec.ps 
on ``Interval constants'' (p17-18) suggests rounding the endpoints
_away_ from the interval in question, but they only deal with closed
intervals ({mix,max}Inclusive).  It's the best reference I have, so I
followed its recommendation.  For exclusive bounds, however, the only
way to avoid including the boundary values is to round _into_ the
interval.

Whether not-a-number should be included is tricky.  While tests like y
<= NaN fail, y !<= NaN should succeed if it existed.  So if you define
a range [lb, ub] by lb <= x <= ub, NaN is not included.  Defining it
by x !<= lb && x !>= ub includes NaN.

So should NaN be included?  Well, any number x can be pushed through a
series of operations to produce NaN, namely the function f(y) = (y -
x) / (y - x).  Any value from a non-empty range can be operated upon
to produce NaN, so NaN should be expected in any non-empty range.  Or
at least that's my shaky reasoning.  Whichever way you go, someone
will disagree, so go whichever way you feel is appropriate.  I'd
probably use one of the constructs proposed in the mailing list
archive to explicitly include it anyways.

The dance around `empty' intervals is bad, but I don't know what else
to do.  A type defining an empty interval doesn't allow any members.
Simply stating that such types are not defined may be an option, but
they can occur unexpectedly from rounding and exclusive ranges.  This
is true even without the bits* facets.

Those unfamiliar with the interpretation of NaN in the IEEE standard
may want to read
  http://www.cs.berkeley.edu:80/~wkahan/ieee754status/ieee754.ps
The rationale behind comparisons with NaNs is described in the INVALID
exception portion, p7-9.  I'd strongly recommend _not_ relying on the
Java documentation.  Java's floating-point support is, well, suspect.

Another good, general floating-point reference is ``What Every
Computer Scientist Should Know about Floating-Point Arithmetic'' by
David Goldberg, available under
  http://www.validgh.com/

Amazing amount of `rationale' for something which seems so obvious at
first.
]

* The floating-point type:

Definition: A floating-point value is either a discretized real number
or a special value.  The basic value space of floating-point consists
of the numeric values m * 2^e, where m and e are signed integers.  The
floating-point value space also contains the following special values:
positive and negative infinity and not-a-number.  The terms m and e
are named the mantissa and the exponent, respectively.

The floating-point type follows the IEEE standard but generalizes the
bit widths of the mantissa and exponent.  The mapping from literals to
values may require rounding.  Implementations must provide default
round-to-nearest behavior.  The default number of bits in the
mantissa's and exponent's representations are left to the
implementation, but they be at least as wide as necessary to represent
the double subtype.  Implementations are encouraged to support
extended precisions.

Lexical representation:  [same as presently in float & double, with
both NAN and INF either all-caps (C's %G) or all-lower (C's %g)]

Constraining facets of floating-point:

	* maxInclusive
	* maxExclusive
	* minInclusive
	* minExclusive
	* enumeration

	* bitsMantissa
	* bitsExponent
	* precision

[Rationale:  

Note that nothing is special about the mantissa of +/- 0.  The zero
simply inherits the sign from the mantissa.  

There is no language to require that implementations provide only
round-to-nearest.  I'd love to see one that handled user-specified
rounding modes properly...

Why not call it real?  Because the decimal type also describes real
values, as does the integer type.  I'm not proposing to unify the
types in a grand hierarchy, so it's best not to unify the names.  It'd
be nice to have such a hierarchy, but then it'd be expressed in Scheme.
]

* The float type:

Definition:  float corresponds to the IEEE single-precision float
type.  It is defined as a subtype of floating-point with the following
facet restrictions:
	bitsMantissa = 24
	bitsExponent = 7

* The double type:

Definition:  double corresponds to the IEEE double-precision float
type.  It is defined as a subtype of floating-point with the following
facet restrictions:
	bitsMantissa = 53
	bitsExponent = 10

              ========================================

This is significantly more than you probably want for the XML Schema
standard, but there are reasons for thinking ahead.  I hope I've
proposed something both feasible and useful.

The ability to express exactly the precision of floating-point numbers
stored in a data file could help XML's acceptance in scientific and
numeric computing.  Just as the decimal type is considered essential
in accounting, the ability to express FP precision and the number of
significant digits is considered essential in numerics.  Unfortunately
(?), we aren't the ones who issue paychecks.  ;)

Jason
(not subscribed, please cc)
Received on Tuesday, 11 January 2000 03:20:09 UTC