From: Edward Jason Riedy <ejr@CS.Berkeley.EDU>

Date: Tue, 11 Jan 2000 00:20:08 -0800

Message-Id: <200001110820.AAA10026@lotus.CS.Berkeley.EDU>

To: www-xml-schema-comments@w3.org

Date: Tue, 11 Jan 2000 00:20:08 -0800

Message-Id: <200001110820.AAA10026@lotus.CS.Berkeley.EDU>

To: www-xml-schema-comments@w3.org

The section on floating-point values in the schema proposal seems to have changed significantly a few times; might I propose one more change? The change is to unify the float and double types under one floating-point type. The floating-point type is a bit vague, but then float and double are concretely described through two additional facets. The number of significant figures is also introduced as a FP facet. I'm just a grad student in numerical computing, and I happen to like the prospects of XML, so I've been reading some of the proposals more relevant to my current work. I have to deal with some file formats that embed Fortran FORMAT statements to describe their data. Yuck. I'd love to replace them, but I need to convince the people above me that XML offers more than what they currently have. And I'm playing with XML-RPC, so I followed the XML-RPC -> SOAP -> XML Schema chain to get here. I hope I'm not coming from too far out... Note: I'm sticking to your use of the term `precision' for the number of significant digits. I'd _greatly_ prefer to see the terminology changed, but that may be too much. In my mind, `precision' is a quality of a measurement with a definition that varies according to how the measurement is taken. I realize that to others, it's something else. However, the definition of ``significant digits'' is clear to everyone. _That's_ the reason why I believe `precision' should be replaced by some randomly capitalized version of ``significant digits.'' Yes, I know what term SQL uses. It's yet another aspect of SQL that should not be emulated. ======================================== Proposal: [this vaguely takes the form of entries in the draft, but the language doesn't quite match] Facets: * bitsMantissa, bitsExponent: Definition: These facets can restrict the number of bits necessary to represent the value space of a floating-point subtype. Both must be positive integers. [Rationale: This makes the definition of floating-point, float, and double exact while retaining extensibility. There are efforts underway to provide wider support for wider floating-point numbers, and they are being well-received in the scientific computing community. Also, the dominant hardware platform (x86) provides extended precision support in hardware. It would be very nice to be able to communicate extended-precision numbers in a standard manner. However, this must be balanced with the difficulty of correctly implementing extended-precision i/o. The floating-point default widths are specified below to be at least those of IEEE double. Support for extended precision should remain a quality of implementation issue for at least a few years, but the standard should be written to accommodate more flexible implementations (imho). It is possible to specify any FP number through a composite type of an integral mantissa and an integral exponent with appropriate ranges on both, but that's almost as obtuse as the FORMATs I want to exterminate. ] * precision: [add support for floating-point] [Rationale: This is how the majority of programmers and programs work with floating-point i/o, so this should be included even if the bit-width facets are not. I'd imagine most people would be happy with a sloppy floating-point type and only some way to specify the number of significant digits. ] * {min,max}{Inclusive,Exclusive} [same, essentially, plus...] Note that these facets must be interpreted carefully in floating-point. The upper and lower bounds will be applied to the value space defined by the most specific bitsMantissa and bitsExponent facets. If rounding is necessary, the inclusive bounds must be rounded _away_ from the interval, while the exclusive bounds must be rounded _into_ the interval. The value not-a-number is implicitly included in every range except when an exclusive interval is empty when applied to the correct value space. [Rationale: This is difficult, but it's difficult even without the bits{Mantissa,Exponent} facets. While I like to consider myself at least a little knowledgeable on FP issues, I'm not at all sure on the subtleties of interval arithmetic. See http://www.cs.utep.edu/interval-comp/main.html for many resources, especially the papers at http://www.mscs.mu.edu/~globsol/walster-papers.html The section of http://www.mscs.mu.edu/~globsol/Papers/spec.ps on ``Interval constants'' (p17-18) suggests rounding the endpoints _away_ from the interval in question, but they only deal with closed intervals ({mix,max}Inclusive). It's the best reference I have, so I followed its recommendation. For exclusive bounds, however, the only way to avoid including the boundary values is to round _into_ the interval. Whether not-a-number should be included is tricky. While tests like y <= NaN fail, y !<= NaN should succeed if it existed. So if you define a range [lb, ub] by lb <= x <= ub, NaN is not included. Defining it by x !<= lb && x !>= ub includes NaN. So should NaN be included? Well, any number x can be pushed through a series of operations to produce NaN, namely the function f(y) = (y - x) / (y - x). Any value from a non-empty range can be operated upon to produce NaN, so NaN should be expected in any non-empty range. Or at least that's my shaky reasoning. Whichever way you go, someone will disagree, so go whichever way you feel is appropriate. I'd probably use one of the constructs proposed in the mailing list archive to explicitly include it anyways. The dance around `empty' intervals is bad, but I don't know what else to do. A type defining an empty interval doesn't allow any members. Simply stating that such types are not defined may be an option, but they can occur unexpectedly from rounding and exclusive ranges. This is true even without the bits* facets. Those unfamiliar with the interpretation of NaN in the IEEE standard may want to read http://www.cs.berkeley.edu:80/~wkahan/ieee754status/ieee754.ps The rationale behind comparisons with NaNs is described in the INVALID exception portion, p7-9. I'd strongly recommend _not_ relying on the Java documentation. Java's floating-point support is, well, suspect. Another good, general floating-point reference is ``What Every Computer Scientist Should Know about Floating-Point Arithmetic'' by David Goldberg, available under http://www.validgh.com/ Amazing amount of `rationale' for something which seems so obvious at first. ] * The floating-point type: Definition: A floating-point value is either a discretized real number or a special value. The basic value space of floating-point consists of the numeric values m * 2^e, where m and e are signed integers. The floating-point value space also contains the following special values: positive and negative infinity and not-a-number. The terms m and e are named the mantissa and the exponent, respectively. The floating-point type follows the IEEE standard but generalizes the bit widths of the mantissa and exponent. The mapping from literals to values may require rounding. Implementations must provide default round-to-nearest behavior. The default number of bits in the mantissa's and exponent's representations are left to the implementation, but they be at least as wide as necessary to represent the double subtype. Implementations are encouraged to support extended precisions. Lexical representation: [same as presently in float & double, with both NAN and INF either all-caps (C's %G) or all-lower (C's %g)] Constraining facets of floating-point: * maxInclusive * maxExclusive * minInclusive * minExclusive * enumeration * bitsMantissa * bitsExponent * precision [Rationale: Note that nothing is special about the mantissa of +/- 0. The zero simply inherits the sign from the mantissa. There is no language to require that implementations provide only round-to-nearest. I'd love to see one that handled user-specified rounding modes properly... Why not call it real? Because the decimal type also describes real values, as does the integer type. I'm not proposing to unify the types in a grand hierarchy, so it's best not to unify the names. It'd be nice to have such a hierarchy, but then it'd be expressed in Scheme. ] * The float type: Definition: float corresponds to the IEEE single-precision float type. It is defined as a subtype of floating-point with the following facet restrictions: bitsMantissa = 24 bitsExponent = 7 * The double type: Definition: double corresponds to the IEEE double-precision float type. It is defined as a subtype of floating-point with the following facet restrictions: bitsMantissa = 53 bitsExponent = 10 ======================================== This is significantly more than you probably want for the XML Schema standard, but there are reasons for thinking ahead. I hope I've proposed something both feasible and useful. The ability to express exactly the precision of floating-point numbers stored in a data file could help XML's acceptance in scientific and numeric computing. Just as the decimal type is considered essential in accounting, the ability to express FP precision and the number of significant digits is considered essential in numerics. Unfortunately (?), we aren't the ones who issue paychecks. ;) Jason (not subscribed, please cc)Received on Tuesday, 11 January 2000 03:20:09 UTC

*
This archive was generated by hypermail 2.3.1
: Wednesday, 7 January 2015 14:49:51 UTC
*