XML Schema Part 2: Datatypes; comment on the decimal datatype

XML Schema Part 2: Datatypes
W3C Candidate Recommendation 24 October 2000
Comment on decimal datatype [section 3.2.5]
============================================

IBM would like to request two small, but critical, changes to the XML
Schema decimal datatype description

1. (Essential)

   Currently the scale of decimal numbers is restricted to be zero or
   positive.  It is requested that this restriction be removed (that is,
   in 2.4.2.11 the value of scale must be an integer, not a
   nonNegativeInteger) for the following reasons:

   a) The current specification allows the representation of very small
      numbers (for example 1E-100) but does not permit the efficient
      representation of even moderately large numbers (for example 13
      billion, or 13E+9), even though such numbers are common in
      commerce.  Allowing positive exponents (negative scales) will
      correct the specification so both large and small numbers can be
      represented equally efficiently.

   b) The current specification is only suitable for representing
      limited range, fixed point, decimal numbers.  Removing the
      restriction will make the representation general, and allow
      practical floating point operations on XML Schema decimal numbers.

   c) Removing the restriction will make conversions between the floating
      binary datatypes and the decimal datatype more efficient and less
      likely to raise exceptions.  For example, a binary floating point
      number approximates a number such as 1E+100 in a few bytes; when
      this is converted to XML Schema decimal it would require 101
      characters, which could exceed implementation limits.  However,
      an exact representation requires only six characters (with only
      one digit of precision being needed, which would be within the
      capabilities of any implementation).

2. (Highly desirable)

   The lexical representation of decimal numbers (3.2.5.1) is currently
   restricted to be a subset of that of binary numbers.  It is proposed
   that the representation of decimal numbers be made the same as for
   binary numbers (3.2.3.1 for float, and 3.2.4.1 for double), for the
   following reasons:

   a) The current proposal has different lexical rules for binary and
      decimal numbers.  This distinction is an unnecessary complication
      which imposes a confusing and artificial syntax grammar; a single
      syntax will simplify the Schema and reduce implementation costs.

   b) At present, if a number is naturally expressed as a number with an
      exponent (for example, 1.3 billion might be written as 1.3E+9) the
      Schema requires that it be expanded to a plain integer form
      (1300000000) which is less efficient, hard to read, and
      error-prone.  This transformation also loses any conventional
      indication of significance.

   c) Similarly, small numbers (such as 1E-100) can be efficiently
      represented in the schema, yet in order to be visualized have to
      be expanded with leading zeros (one hundred, in this case).  This
      is inefficient, hard to read, and error-prone.

   d) It is usual to define a conversion between binary and decimal
      numbers as a conversion from binary to string form, and then from
      that string form to decimal (or vice versa).  If decimal numbers
      cannot be expressed using exponential form, these 'round trips'
      are impossible in one direction and impractical in the other.
      As specified, one lexical syntax of numbers defined by the XML
      Schema is incompatible with another.

   e) At present, exponential notation can only be used for binary
      floating point numbers.  Since these can only approximate many
      decimal fractions, the Schema has no mechanism for handling
      numbers in exponential notation precisely, other than as character
      strings.


Supporting information:

1. The Java class library has a BigDecimal class which has similar
   restrictions to the proposed XML Schema.  This has proved to be so
   disadvantageous that numerous companies have supported IBM's request
   to remove the restriction.  For details, see the Java Specification
   Request JSR-13, at

     http://www2.hursley.ibm.com/decimalj/jsr-decimal.html

   which lists a representative selection of those companies, and
   expands on the rationale.  This JSR has been approved by Sun and is
   expected to result in an improved BigDecimal class in due course.

2. The W3C XForms working group would very much welcome full support for
   decimal datatypes in XML Schema.  Numbers are entered in decimal,
   displayed in decimal, and increasingly are stored in decimal.  Many
   users are confused when calculations (based upon binary arithmetic)
   deliver different results from the way they were taught at school.
   The performance degradation incurred by using decimal instead of
   binary arithmetic is expected to be insignificant for forms-based
   applications, where much processing takes place in the client.

3. Programming languages and their libraries increasingly support
   floating point or wide-range decimal numbers.  These include Java
   (see above), COBOL, the Rexx family, C#, application libraries for C,
   C++, and Ada, and many scripting languages.

4. Decimal data is predominant in commercial databases, and arithmetic
   on these data often requires wide ranges, especially for large
   numbers.  One survey (partially reported in IBM Technical Report TR
   03.413 by A. Tsang & M. Olschanowsky) analyzed the column datatypes
   of databases owned by 51 major organizations.  These databases
   covered a wide range of applications, including Airline systems,
   Banking, Financial Analysis, Insurance, Inventory control, Management
   reporting, Marketing services, Order entry, Order processing,
   Pharmaceutical applications, and Retail sales.

   Of these columns, 41.8% contained identifiably numeric data; in
   these, the breakdown by datatype was:

     Type      | Columns  | percent
     ----------+----------+---------
     Decimal   |  251038  |   55.0
     SmallInt  |  120464  |   26.4
     Integer   |   78842  |   17.3
     Float     |    6180  |    1.4

   Since both SmallInt and Integer could have been represented by
   Decimal type numbers without loss, 98.6% of the numeric columns in
   the sample could have used a decimal representation.

5. For additional information on the direction and significance of
   decimal data and arithmetic, see http://www2.hursley.ibm.com/decimal


- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Mike Cowlishaw FREng, IBM Fellow
mailto:mfc@uk.ibm.com  --  http://www2.hursley.ibm.com/mfcsumm.htm

Received on Wednesday, 8 November 2000 05:50:18 UTC