Re: Canonical Representation for decimal from MFC@uk.ibm.com on 2000-12-13 (www-xml-schema-comments@w3.org from October to December 2000)

From: <MFC@uk.ibm.com>
Date: Wed, 13 Dec 2000 11:40:05 +0000
To: www-xml-schema-comments@w3.org
Message-ID: <802569B4.0043AFF8.00@d06mta10.portsmouth.uk.ibm.com>
The proposed canonical representation would introduce a serious problem
into the definition for decimal.

Recall that the value space of decimal is the set of the values i * 10^-n,
(where i and n are integers, * means multiply, ^ means raise to power,
and currently n must be non-negative).

In todays usage, in both databases and in programming languages, decimal
numbers are almost always represented in precisely that manner, that is
an integer (i) and a scale (n).  The scale may be implicit or explicit,
depending on the language or database.

As defined, and in actual representations, therefore, the value space
of decimal can have multiple but distinct values which are numerically
equal.  This is an important attribute of the representation, especially
useful in financial and enginering contexts.  For example, the two
values:

   1 * 10^-0
 100 * 10^-2

are distinct (the first is normally written as '1', the second as
'1.00').  They have different integer and scale parts.

Now, the canonical lexical representation should be a set of literals
such that there is a one-to-one mapping between literals in the
canonical lexical representation and values in the value space.

The proposed new wording does not satisfy this definition; it would
show both the distinct values above as the same literal ('1.0'), whereas
in fact the only value that can correctly be shown in this way is:

  10 * 10^-1

In other words, the proposed canonical representation would lose
information; if a decimal number (such as the second example, 100 with a
scale of 2) were encoded using the proposed canonical representation
then its original form could not be recovered as there is not a
one-to-one mapping.

- - - - -

One unambiguous canonical representation would be to use an exponential
notation matching the value space (that is, for the three examples
above: 1, 100E-2, and 10E-1).  However, the current draft prefers plain
numbers (as do most people and programming languages when the range is
small).  For plain numbers, the words used are typically something like
(using XML-schema naming):

  The absolute value of the integer (i) is first converted to a string
  in base ten using the characters '0' through '9' with no leading zeros
  (except if its value is zero, in which case a single '0' character is
  used).

  If the scale (n) is zero then no decimal point is added.

  Otherwise (the scale is positive), a decimal point will be inserted
  into the converted integer with the value of the scale specifying the
  number of characters to the right of the decimal point.  '0'
  characters are added, to the left of the converted integer, if
  necessary to allow this insertion.  If no character precedes the
  decimal point after the insertion then a conventional '0' character is
  prefixed.

  Finally, if the integer (i) was less than 0 then the entire string is
  prefixed by a minus sign character.

This definition preserves the one-to-one mapping and also meets the
requirements for lexical space (section 2.3), notably that the literals
should correspond to those found in common programming languages and
libraries.

I would suggest that the canonical representation should follow this
definition (it's also perhaps more understandable if a positive
description of the tighter definition is given, rather than trying to
derive a tight definition by prohibiting aspects of a vague definition).

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Mike Cowlishaw, IBM Fellow
mailto:mfc@uk.ibm.com  --  http://www2.hursley.ibm.com/decimal
Received on Wednesday, 13 December 2000 07:20:16 UTC