Re: Primitive Datatypes of XML Schema (boolean, float, double)

The attached note summarizes the reasons that persuaded the WG to shift from a
single abstract real-number datatype to the more concrete IEEE-based float and
double datatypes.

- Mark Reinhold
  Senior Staff Engineer                 901 San Antonio Road
  Core Java Platform Group              Palo Alto, CA 94303
  Java Software                         408-343-1830
  Sun Microsystems, Inc.                mr@eng.sun.com
		Floating-point datatypes are not real datatypes
			Mark Reinhold <mr@eng.sun.com>
				5 October 1999


The current "XML Schema: Datatypes" draft [1], including a proposed amendment
[2], contains facets that are intended to support the definition of generated
datatypes for floating-point number formats, such as those described by the
IEEE-754 standard, by refinement from the real-number datatype.  Floating-point
numbers are, however, but a rough model of the real numbers.  The fundamental
differences between these types of number systems render the facet-based
approach unworkable.  If the datatypes specification is to contain datatypes
for floating-point numbers than it should define them so as to be completely
unrelated to the other numeric datatypes.


FLOATING-POINT NUMBERS ARE NOT REAL NUMBERS

Floating-point value spaces are fundamentally different from real and decimal
value spaces in, at least, the following ways:

  (1) The relationship between the sets of numbers in the floating-point and
      real value spaces is not trivial.

A binary floating-point value space cannot be defined in terms of the reals via
simple range constraints, via constraints on both mantissa and exponent
magnitude, or via constraints upon absolute values.  A faithful definition of a
particular floating-point value space in terms of the real numbers must
constrain the reals to values that can be expressed in the form m*b^e, where m
is a nonzero integer mantissa value within given bounds, b is a fixed positive
integer exponent base (typically a power of two), and e is an integer exponent
within given bounds.

Given this we could, in principle, derive a datatype for the IEEE-754
single-precision format from the real datatype by something like this, as
previously suggested by Olken and McCarthy [3]:

      <datatype name="ieee32">
	<basetype name="real">
	<exponentBase>2</exponentBase>
	<minExponent>-149</minExponent>
	<maxExponent>104</maxExponent>
	<minMantissa>-16777216</minMantissa>
	<maxMantissa>16777216</maxMantissa>
      </datatype>

While mathematically elegant, this approach is unlikely to be intuitive to, and
therefore unlikely to be used by, typical XML schema authors.  The five facets
shown here would most likely only be used in the definition of generated
datatypes within the schema specification and, perhaps, by schema experts.

Supporting these facets would, moreover, add considerable complexity to the
implementation of schema processors, which would have to be prepared to handle
any floating-point value space that can be described by these facets.  Paul
Biron has observed [4] that it is increasingly common for programming
environments to provide libraries that implement arbitrary-precision integer
and decimal arithmetic.  Arbitrary-precision floating-point arithmetic is,
however, another beast entirely and is far from common.  Programming
environments that support floating-point arithmetic are generally limited to
the capabilities of the underlying hardware.

  (2) Floating-point value spaces contain elements that do not belong
      in the real, decimal, or integer value spaces.

The IEEE floating-point formats, in particular, contain elements representing
+/-Inf, +/-0, and the NaN values.  No programming environment of which I'm
aware uses these values in decimal or integer computations.  These values
should, therefore, not be elements of the decimal or integer value spaces as
currently implied by productions 34 and 35 of the datatypes draft.  Neither
should these values be elements of the real value space, which is intended to
be a more faithful model of the real numbers and therefore has no need of
infinities, NaNs, or more than one zero.

  (3) The mapping between floating-point lexical and value spaces is much more
      complex than in the decimal and integer cases.

The mapping from a string of digits and punctuation in one of the usual formats
to an arbitrary-precision exact internal form (e.g., java.math.BigDecimal) is
very simple because every number representable by such a string is concretely
representable in the internal form.  This is not the case for floating-point
numbers, where a program that parses number strings must carefully round up or
down to the floating-point value that most closely represents the intended
number [5].  This inherent approximation is why a datatype definition such as

      <datatype name="foo">
	<basetype name="ieee32">
	<maxInclusive>0.1</maxInclusive>
      </datatype>

admits instances whose values, when taken as real numbers, violate the range
constraint [6].  An instance containing the number string "0.1000000001", e.g.,
satisfies this datatype because a correctly-rounding parser would round both
"0.1" and "0.1000000001" up to the value 0.100000001490116119384765625, the
element of the IEEE single-precision value space that is closest to the real
numbers represented by these number strings.  If the base type in the above
example were decimal then this situation would not arise.

These three points strongly suggest that any floating-point datatype(s) in the
datatypes specification should be completely divorced from the real, and hence
decimal, datatypes.  Deriving a floating-point datatype from the real datatype
would impose burdensome conceptual and implementation complexities (1).  A
floating-point datatype cannot be derived simply by constraining the real
datatype because the subtype must contain values that are not present in the
supertype (2).  Finally, the lexical representations of floating-point numbers
must be parsed and compared differently than those of reals or decimals (3).


A SIMPLE PROPOSAL

Given these conclusions I suggest the following simple approach to supporting
floating-point numbers in version 1.0 of the datatypes specification:

  (A) Introduce two new primitive base types, "float" and "double",
      corresponding to the IEEE-754 single- and double-precision formats,
      respectively.

I've used the names "float" and "double" intentionally here.  These names,
which are common to C, C++, Java, and other programming languages, seem much
more usable than the less familiar "ieee32" and "ieee64", which are moreover
difficult to speak and to type.

The value spaces of these datatypes should be defined precisely as IEEE-754
defines them, but for simplicity all the NaN values can be collapsed into a
single NaN value as in Java.  The lexical spaces should be defined to include
+/-Inf, NaN, and +/-0.  The mappings between the lexical and value spaces
should be specified to satisfy the value-preserving requirements outlined by
Steele and White [7], thereby ensuring repeatable and intuitive results for
common use cases such as those given above and by Layman [8].

The float and double datatypes should not be related to any other types or even
to each other (see below).

  (B) Remove the +/-Inf and NaN literals and values from the lexical and value
      spaces of the decimal datatype and all derived datatypes.

As noted above these values are rarely, if ever, supported in actual
specifications or implementations of decimal or integer arithmetic.

  (C) Remove the real datatype.

This final change would leave decimal as a standalone primitive base type from
which integer, etc., are derived.  The real datatype would only remain
interesting if we were going to support non-decimal representations of real
numbers, e.g., the exact rational notation supported by Scheme [9].  Given that
we're not planning to do this, and that the floating-point types are no longer
being defined in terms of reals, the real datatype no longer serves any useful
purpose.


FLOATS ARE NOT DOUBLES

The float and double datatypes should not be related to any other types.  They
also should not be related to each other because the lexical-to-value mapping
is different for floating-point value spaces of different precisions.  The real
number 1e-17, e.g., is most closely represented in the double value space by

      6490371073168535 * 2^(966-1075)
	  == .000000000000000010000000000000000715424240546219245085082726...

and in the float value space by

      12089258 * 2^(70-150)
	  == .000000000000000009999999837751590242660576501876334987173322...

Since the number string "1e-17" (among many others) does not map to the same
value in these two value spaces it would be inconsistent to declare float to be
a subtype of double.  Doing so would violate the principle that if a string
maps to a given value in a particular type then it should map to the same value
in all supertypes.  This principle is not stated explicitly in the datatypes
specification, but it is fundamental to subtyping in programming languages.  If
it is violated in the XML Schema language then mappings from XML schemas to
common programming-language constructs will be made that much more cumbersome.


CONCLUSION

The above proposal should be sufficient to make XML Schema v1.0 useful for a
wide variety of practical applications.  Due to the fundamental differences
between floating-point and other number systems described above, none of the
previously-proposed definitions of floating-point datatypes is tenable.  I
would prefer that XML Schema v1.0 omit floating-point datatypes entirely rather
than contain definitions that add significant conceptual and implementation
complexities and are inconsistent with common computational practice.

No primitive base types other than the IEEE-754 single- and double-precision
types are included in this proposal.  Because these are the only floating-point
formats for which implementations are widely available, the specification
should not require support for any others.  Doing so would place an undue
burden upon implementors of schema processors.  It may be useful, however, to
specify a few optional primitive base types for less common formats such as the
IEEE-754 quad-precision format and the legacy IBM hexadecimal format.  Each
such type would be optional in the sense that an implementor of a conforming
schema processor may choose to support it either according to the specification
or not at all.

This proposal is more draconian than the suggestions previously made by Olken
and McCarthy [3].  A point made in their conclusion, however, is well worth
repeating: We are not experts in floating-point arithmetic, so it is critical
that our final proposal be thoroughly reviewed by people who are.


REFERENCES

[1] XML Schema Part 2: Datatypes (W3C Working Draft 24 September 1999)
    http://www.w3.org/TR/1999/WD-xmlschema-2-19990924/

[2] Paul Biron: real number datatype amendments
    http://lists.w3.org/Archives/Member/w3c-xml-schema-ig/1999Sep/0151.html

[3] Frank Olken and John McCarthy: real number specification in XML Schema
    http://lists.w3.org/Archives/Member/w3c-xml-schema-wg/1999Jun/0120.html

[4] Paul Biron: Re: Bignums required for XML Schema?
    http://lists.w3.org/Archives/Member/w3c-xml-schema-wg/1999Jun/0157.html

[5] William D Clinger: How to Read Floating Point Numbers Accurately.
    In Proceedings of the Conference on Programming Language Design and
    Implementation, ACM, 1990, pp. 92-101.
    http://www.ccs.neu.edu/home/will/papers.html

[6] Mark Reinhold: Re: real number datatype amendments
    http://lists.w3.org/Archives/Member/w3c-xml-schema-ig/1999Sep/0202.html

[7] Guy L. Steele Jr. and Jon L White: How to Print Floating-Point Numbers
    Accurately.  In Proceedings of the Conference on Programming Language
    Design and Implementation, ACM, 1990, pp. 112 - 126.

[8] Andrew Layman: Re: real number datatype amendments
    http://lists.w3.org/Archives/Member/w3c-xml-schema-ig/1999Sep/0219.html

[9] Revised^5 Report on the Algorithmic Language Scheme: §6.2: Numbers
    http://www.schemers.org/Documents/Standards/R5RS/r5rs_49.html#SEC51

Received on Monday, 14 February 2000 12:48:37 UTC