- From: Mark Reinhold <mr@eng.sun.com>
- Date: Mon, 14 Feb 2000 09:23:21 -0800
- To: Ofer Brandes <brandes@mintech.co.il>
- cc: connolly@w3.org
- Message-Id: <200002141743.JAA22389@engmail1.Eng.Sun.COM>
The attached note summarizes the reasons that persuaded the WG to shift from a single abstract real-number datatype to the more concrete IEEE-based float and double datatypes. - Mark Reinhold Senior Staff Engineer 901 San Antonio Road Core Java Platform Group Palo Alto, CA 94303 Java Software 408-343-1830 Sun Microsystems, Inc. mr@eng.sun.com
Floating-point datatypes are not real datatypes Mark Reinhold <mr@eng.sun.com> 5 October 1999 The current "XML Schema: Datatypes" draft [1], including a proposed amendment [2], contains facets that are intended to support the definition of generated datatypes for floating-point number formats, such as those described by the IEEE-754 standard, by refinement from the real-number datatype. Floating-point numbers are, however, but a rough model of the real numbers. The fundamental differences between these types of number systems render the facet-based approach unworkable. If the datatypes specification is to contain datatypes for floating-point numbers than it should define them so as to be completely unrelated to the other numeric datatypes. FLOATING-POINT NUMBERS ARE NOT REAL NUMBERS Floating-point value spaces are fundamentally different from real and decimal value spaces in, at least, the following ways: (1) The relationship between the sets of numbers in the floating-point and real value spaces is not trivial. A binary floating-point value space cannot be defined in terms of the reals via simple range constraints, via constraints on both mantissa and exponent magnitude, or via constraints upon absolute values. A faithful definition of a particular floating-point value space in terms of the real numbers must constrain the reals to values that can be expressed in the form m*b^e, where m is a nonzero integer mantissa value within given bounds, b is a fixed positive integer exponent base (typically a power of two), and e is an integer exponent within given bounds. Given this we could, in principle, derive a datatype for the IEEE-754 single-precision format from the real datatype by something like this, as previously suggested by Olken and McCarthy [3]: <datatype name="ieee32"> <basetype name="real"> <exponentBase>2</exponentBase> <minExponent>-149</minExponent> <maxExponent>104</maxExponent> <minMantissa>-16777216</minMantissa> <maxMantissa>16777216</maxMantissa> </datatype> While mathematically elegant, this approach is unlikely to be intuitive to, and therefore unlikely to be used by, typical XML schema authors. The five facets shown here would most likely only be used in the definition of generated datatypes within the schema specification and, perhaps, by schema experts. Supporting these facets would, moreover, add considerable complexity to the implementation of schema processors, which would have to be prepared to handle any floating-point value space that can be described by these facets. Paul Biron has observed [4] that it is increasingly common for programming environments to provide libraries that implement arbitrary-precision integer and decimal arithmetic. Arbitrary-precision floating-point arithmetic is, however, another beast entirely and is far from common. Programming environments that support floating-point arithmetic are generally limited to the capabilities of the underlying hardware. (2) Floating-point value spaces contain elements that do not belong in the real, decimal, or integer value spaces. The IEEE floating-point formats, in particular, contain elements representing +/-Inf, +/-0, and the NaN values. No programming environment of which I'm aware uses these values in decimal or integer computations. These values should, therefore, not be elements of the decimal or integer value spaces as currently implied by productions 34 and 35 of the datatypes draft. Neither should these values be elements of the real value space, which is intended to be a more faithful model of the real numbers and therefore has no need of infinities, NaNs, or more than one zero. (3) The mapping between floating-point lexical and value spaces is much more complex than in the decimal and integer cases. The mapping from a string of digits and punctuation in one of the usual formats to an arbitrary-precision exact internal form (e.g., java.math.BigDecimal) is very simple because every number representable by such a string is concretely representable in the internal form. This is not the case for floating-point numbers, where a program that parses number strings must carefully round up or down to the floating-point value that most closely represents the intended number [5]. This inherent approximation is why a datatype definition such as <datatype name="foo"> <basetype name="ieee32"> <maxInclusive>0.1</maxInclusive> </datatype> admits instances whose values, when taken as real numbers, violate the range constraint [6]. An instance containing the number string "0.1000000001", e.g., satisfies this datatype because a correctly-rounding parser would round both "0.1" and "0.1000000001" up to the value 0.100000001490116119384765625, the element of the IEEE single-precision value space that is closest to the real numbers represented by these number strings. If the base type in the above example were decimal then this situation would not arise. These three points strongly suggest that any floating-point datatype(s) in the datatypes specification should be completely divorced from the real, and hence decimal, datatypes. Deriving a floating-point datatype from the real datatype would impose burdensome conceptual and implementation complexities (1). A floating-point datatype cannot be derived simply by constraining the real datatype because the subtype must contain values that are not present in the supertype (2). Finally, the lexical representations of floating-point numbers must be parsed and compared differently than those of reals or decimals (3). A SIMPLE PROPOSAL Given these conclusions I suggest the following simple approach to supporting floating-point numbers in version 1.0 of the datatypes specification: (A) Introduce two new primitive base types, "float" and "double", corresponding to the IEEE-754 single- and double-precision formats, respectively. I've used the names "float" and "double" intentionally here. These names, which are common to C, C++, Java, and other programming languages, seem much more usable than the less familiar "ieee32" and "ieee64", which are moreover difficult to speak and to type. The value spaces of these datatypes should be defined precisely as IEEE-754 defines them, but for simplicity all the NaN values can be collapsed into a single NaN value as in Java. The lexical spaces should be defined to include +/-Inf, NaN, and +/-0. The mappings between the lexical and value spaces should be specified to satisfy the value-preserving requirements outlined by Steele and White [7], thereby ensuring repeatable and intuitive results for common use cases such as those given above and by Layman [8]. The float and double datatypes should not be related to any other types or even to each other (see below). (B) Remove the +/-Inf and NaN literals and values from the lexical and value spaces of the decimal datatype and all derived datatypes. As noted above these values are rarely, if ever, supported in actual specifications or implementations of decimal or integer arithmetic. (C) Remove the real datatype. This final change would leave decimal as a standalone primitive base type from which integer, etc., are derived. The real datatype would only remain interesting if we were going to support non-decimal representations of real numbers, e.g., the exact rational notation supported by Scheme [9]. Given that we're not planning to do this, and that the floating-point types are no longer being defined in terms of reals, the real datatype no longer serves any useful purpose. FLOATS ARE NOT DOUBLES The float and double datatypes should not be related to any other types. They also should not be related to each other because the lexical-to-value mapping is different for floating-point value spaces of different precisions. The real number 1e-17, e.g., is most closely represented in the double value space by 6490371073168535 * 2^(966-1075) == .000000000000000010000000000000000715424240546219245085082726... and in the float value space by 12089258 * 2^(70-150) == .000000000000000009999999837751590242660576501876334987173322... Since the number string "1e-17" (among many others) does not map to the same value in these two value spaces it would be inconsistent to declare float to be a subtype of double. Doing so would violate the principle that if a string maps to a given value in a particular type then it should map to the same value in all supertypes. This principle is not stated explicitly in the datatypes specification, but it is fundamental to subtyping in programming languages. If it is violated in the XML Schema language then mappings from XML schemas to common programming-language constructs will be made that much more cumbersome. CONCLUSION The above proposal should be sufficient to make XML Schema v1.0 useful for a wide variety of practical applications. Due to the fundamental differences between floating-point and other number systems described above, none of the previously-proposed definitions of floating-point datatypes is tenable. I would prefer that XML Schema v1.0 omit floating-point datatypes entirely rather than contain definitions that add significant conceptual and implementation complexities and are inconsistent with common computational practice. No primitive base types other than the IEEE-754 single- and double-precision types are included in this proposal. Because these are the only floating-point formats for which implementations are widely available, the specification should not require support for any others. Doing so would place an undue burden upon implementors of schema processors. It may be useful, however, to specify a few optional primitive base types for less common formats such as the IEEE-754 quad-precision format and the legacy IBM hexadecimal format. Each such type would be optional in the sense that an implementor of a conforming schema processor may choose to support it either according to the specification or not at all. This proposal is more draconian than the suggestions previously made by Olken and McCarthy [3]. A point made in their conclusion, however, is well worth repeating: We are not experts in floating-point arithmetic, so it is critical that our final proposal be thoroughly reviewed by people who are. REFERENCES [1] XML Schema Part 2: Datatypes (W3C Working Draft 24 September 1999) http://www.w3.org/TR/1999/WD-xmlschema-2-19990924/ [2] Paul Biron: real number datatype amendments http://lists.w3.org/Archives/Member/w3c-xml-schema-ig/1999Sep/0151.html [3] Frank Olken and John McCarthy: real number specification in XML Schema http://lists.w3.org/Archives/Member/w3c-xml-schema-wg/1999Jun/0120.html [4] Paul Biron: Re: Bignums required for XML Schema? http://lists.w3.org/Archives/Member/w3c-xml-schema-wg/1999Jun/0157.html [5] William D Clinger: How to Read Floating Point Numbers Accurately. In Proceedings of the Conference on Programming Language Design and Implementation, ACM, 1990, pp. 92-101. http://www.ccs.neu.edu/home/will/papers.html [6] Mark Reinhold: Re: real number datatype amendments http://lists.w3.org/Archives/Member/w3c-xml-schema-ig/1999Sep/0202.html [7] Guy L. Steele Jr. and Jon L White: How to Print Floating-Point Numbers Accurately. In Proceedings of the Conference on Programming Language Design and Implementation, ACM, 1990, pp. 112 - 126. [8] Andrew Layman: Re: real number datatype amendments http://lists.w3.org/Archives/Member/w3c-xml-schema-ig/1999Sep/0219.html [9] Revised^5 Report on the Algorithmic Language Scheme: §6.2: Numbers http://www.schemers.org/Documents/Standards/R5RS/r5rs_49.html#SEC51
Received on Monday, 14 February 2000 12:48:37 UTC