Re: Question about number types from C. M. Sperberg-McQueen on 2008-07-04 (www-xml-schema-comments@w3.org from July to September 2008)

From: C. M. Sperberg-McQueen <cmsmcq@acm.org>
Date: Fri, 4 Jul 2008 09:10:52 -0600
To: Alan Ruttenberg <alanruttenberg@gmail.com>
Cc: "C. M. Sperberg-McQueen" <cmsmcq@acm.org>, Dave Peterson <davep@iit.edu>, www-xml-schema-comments@w3.org
Message-Id: <8937A1F0-B5F0-4811-B2B1-66D0AAE1C154@acm.org>
On 2 Jul 2008, at 23:27 , Alan Ruttenberg wrote:

> On Jul 2, 2008, at 5:25 PM, Dave Peterson wrote:
>
>> ... But I think what you probably
>> want is to derive float and double from decimal.
> ...
>
>> The problem with that  is that such a derivation would violate a  
>> fundamental property that
>> we wanted derivation to have:  If a value is removed from the value
>> space during a derivation, that automatically removes its lexical
>> representations from the lexical space.  However, float and double
>> require that strings that exactly represent a decimal value not in
>> the float or double value space be mapped to the nearest value that
>> is in the lexical space.
>>
>> Rather than remove that fundamental property of derivation, we  
>> decided
>> to leave float and double as separate primitives.
>
> Perhaps this is a stupid question, but why is this a fundamental  
> property of derivation? One generally thinks of types in terms of  
> subset relations.

Yes, indeed.

In XSD, datatypes can be viewed extensionally as a mapping from
lexical space to value space.  (In fact, "datatype" is the term
used in the XSD spec precisely for the extensional view, and
"simple type" or "simple type definition" for the intensional
view.)  For the primitive and ordinary datatypes
(i.e. for all datatypes except the special datatypes anyType,
anySimpleType, and anyAtomicType) the lexical space is the range
and the value space the domain of the lexical mapping relation.
Restriction involves taking a subset of the base type; lexical facets
specify a subset of the lexical space, and cause corresponding subsets
of the mapping and value space to be generated, while value facets
specify a subset of the value space and cause corresponding subsets
of the mapping and lexical space to be generated.

The most economical way to think about it, although not the most
economical way to describe it so that people can actually use the
derivation mechanisms, is to consider all facets as filtering the
mapping relation (m' = l <: m for a lexical facet specifying a subset
l of the lexical space, or m' = m :> v for a value facet specifying
a subset v of the value space, if the operators <: and :> mean
anything to you).

Since the lexical mappings of float and double map literals to the
nearest value, while the lexical mappings for decimal and the
real type present in early drafts map literals to an exact value,
neither mapping appears to be plausible as a subset of the other.

No doubt other stories could be devised about how the lexical mapping
of a restriction relates to the lexical mapping of the base type.
But, as you say, the story that says "it's a subset" is simple
and appeals to fundamental intuitions about restrictions.  So XSD
has chosen that story.

There might be some way to tell that story and still get all the
numeric datatypes into a single derivation hierarchy, but I don't
know how to do that.

Another issue that arose in early drafts which attempted to derive
float and double from a real-number datatype:  the facets one must
define in order to describe the relation seem arbitrary and ad hoc,
lacking in any mathematical motivation, nearly incomprehensible
in fact, unless one asks what mathematical properties one must
exploit in order to represent approximations of real numbers in a
binary floating-point format designed for convenient representation
inside electronic devices.  It seemed simpler and more straightforward
to say that float and double are intended to match IEEE numbers
than to say that they are a particular subset of the reals defined
by application of particular facets.

Then, too, deriving float and double by specifying 2 as a base
and particular sizes for exponent and mantissa seems to suggest that
the same facets might be given different values, so as to make it
possible for schema authors to define a set of numbers which correspond
to a base-11 number with 17 digits of mantissa and 16 digits
of exponent.  (Or substitute any positive integers of your choice
for 17 and 16 here, and any integer greater than 1 for 11.)

The three designs available seemed to boil down to:

   - abstract numeric type with facets to allow definition of
     floating- and fixed-point numbers with arbitrary bases
     and capacities -- aka Implementors' Nightmare
   - abstract numeric type with facets for defining IEEE float
     and double, which however schema authors are forbidden to
     use, so the generality of the facet mechanism is purely
     illusory:  for all intents and purposes, the IEEE types
     are defined by magic, and the 'facets' are a fig leaf
   - primitives for the types actually to be supported, with
     provisions for type coercion in the languages which use
     them (as, for example, in the XPath Functions and Operators
     spec)

None of these seem to be so beautiful and obviously right that
everyone would greet it as the one true solution, but on the whole
I think the third approach, taken by XSD 1.0, is more honest and
straightforward, at least for the problems of validation that
XSD must solve.  As the XSD spec says, the mapping from XSD types
to types in a programming language or other system is not fixed,
and there is no requirement that XSD primitives map to primitives
in the other system, or vice versa.

> 2) That you inadvertently make the comparison emphasizes the point  
> that floats and decimals *are* comparable. When I said above that I  
> worry that the theory is not coherent, it is the absence of any  
> explanation within the specification of how such a comparison could  
> be made that forms part of such a concern.

Personally, I thought the spec was fairly clear that the
disjointness of the primitives is a given for purposes of XSD,
and is not intended as a constraint on other systems, which
will of course wish to compare values across primitive types.

> ps. Please consider this a formal comment on the specification. If  
> desired I can submit it to the bug tracker.

Yes, please do.

When you do, it would be helpful if you clarified whether the gist
of your comment is

   (a) please reorganize your type hierarchy for numerics from the
       ground up
   (b) please say more explicitly whether it makes sense for  
applications
       and systems not performing XSD schema-validity assessment to
       compare values with different primitive types
   (c) multiple primitive numerics?  blecch!  yuck!

Speaking only for myself, I think (b) or something similar might be
plausible, but (a) is not likely to happen in a point release (or
for that matter in any spec claiming to define a version of XSD) and
(c) will elicit either a shrug or a sympathetic sigh, but probably not
a change to the spec.

As Michael Kay has said in this thread, there are a lot of interesting
issues here, and no one right answer.

--C. M. Sperberg-McQueen
   World Wide Web Consortium
Received on Friday, 4 July 2008 15:11:30 UTC