Data types again

I have revised my paper on data typing at 

[1] http://www.textuality.com/xml/typing.html

to take into account some of the issues raised here.  The following
changes have been made: 

 1. scientific notation is supported for FLOAT data types
 2. the DATE/TIME/TIMESTAMP types conform to ISO 8601:1988 (thanks to
    Peter M-R for digging this up)
 3. rather than overload everything on XML-SQLSIZE, there are several
    independent data type parameterization attributes
 4. used the nice explanatory technique from [2] below of
    providing element-like declarations for each type, to illustrate
    the use of the various parameterizing attributes
 5. more examples

Another note that is relevant is 

[2] Jon Bosak's legible version of Jon Paoli's posting, dated 16/05/97,
    embodying a Microsoft proposal, entitled "SD3 - Data Types [fmt]"

The following issues need to be decided by the ERB - most of them have
been sufficiently discussed I think, but it can't hurt to lay them out.

DT-1. Should we propose a data typing mechanism as part of the XML work?
 Pro: the whole SGML world has been screaming for this for years, and
      XML needs it even more.
 Con: less is more - the world, despite its screams, has limped along
      without it all this time.  Also we have other important work to do.
      
DT-2. Should the data typing mechanism be a separate paper in the WD-xml
 series rather than part of XML-lang?
 Pro: Keep XML-lang simple.  SGML (& maybe HTML) can use it too.
 Con: The usefulness of XML-lang may be impaired if it doesn't have
      the typing guaranteed to be built-in.

DT-3. Should the data typing be a universal/extensible regexp-based thing,
 (as proposed by Gavin Nicol and others) rather than a simple subset of
 of the SQL types as proposed in [1]?
 Pro: extensibility is good - the usages of SGML and XML are unpredictable;
      SQL types were designed for boring COBOL applications.
 Con: we already have extensibility with SGML extended facilities lextypes;
      the SQL types are proven in commercial practice, and are presented
      at the right level for the people who build real applications.

DT-4. Should data typing be provided for attribute values, not just
 content as proposed in [1]?
 Pro: the minimal typing provided by SGML is for attributes; they are
      typically a good place to put atomic values any way.
 Con: for element content, you can do it with just one or two typing
      attributes - if you want to do attributes, the mapping machinery
      gets bigger and more complicated.  Once again, less is more - if
      we have it for elements, do we really need it for attributes?

DT-5. Should data typing for the content of elements be applicable to
 mixed content as proposed in [2], as well as pure character data content
 as proposed in [1]?
 Pro: There's no real architectural problem with doing mixed content, you
      just pretend the child elements aren't there.
 Con: With mixed content you can get into whitespace problems; also, it
      "feels like" the data typing should apply to atomic items, and
      mixed content doesn't "feel" atomic.

DT-6. Should the primary attribute name be XML-TYPE as proposed in [2]
 rather than XML-SQLTYPE as proposed in [1]?
 Pro: Shorter is better; having all these attributes with SQL in front
      of them makes them much less readable.
 Con: These are not pulled out of the air, but rely heavily on SQL; it
      may be desirable to have other typing mechanisms introduced
      later; they would mostly be predeclared in internal subsets; 
      terseness is not supposed to be a big deal per our design goals.

DT-7. Should the typing proposal include data value ranges as proposed
 in [1] (omitted in [2])?
 Pro: Databases use them; this will make it pretty easy to construct an
      input/authoring system that will produce examples that are loadable
      into a database with somewhat higher confidence.
 Con: This really amounts to validation rather than typing, which is
      another domain; the approach in [1] is violently incompatible with
      SQL, which does it with real SQL queries, that can also be 
      extended to involve the values of other queries - in fact, anything
      that an SQL query can do.

DT-8. Should the CHAR datatype use both SIZE and MAXSIZE parameters as
 in [2] (in [1], it only had SIZE and was conceived of as a fixed-size
 field)?
 Pro/Con: I don't understand [2] in this regard - I was following SQL
          in assuming that CHAR fields are fixed-size and thus need only 
          one parm.

DT-9. Should the numeric data types DECIMAL, INTEGER, and FLOAT use
 parameters designed to control the maximum datum size as in [2]'s
 XML-DIGITS, XML-DIGITS-R (both for DECIMAL and INTEGER) and 
 XML-BITS (for FLOAT) ([1] provides only SCALE, for DECIMAL, and
 nothing else).
 Question: what is DIGITS-R, anyhow?  It doesn't show up in the
           example and is not otherwise explained.
 Pro: Better control over the storage requirements, and another
      useful validation step.
 Con: More machinery - perhaps beyond XML's complexity cost/benefit ratio.

DT-10. For the FLOAT datatype, should we prescribe that these internally
 correspond to IEEE floats (as proposed by someone I forget who)?
 Pro: This would make comparison and sorting deterministic, and ensure
      that the string representation corresponds to a particular bit
      pattern.
 Con: Overspecification - comparison/sorting of the strings is
      deterministic anyhow, also presumably of the underlying bit
      patterns, anybody who compares a string against a bit
      pattern has rocks in their head anyhow.

 - Tim

Received on Wednesday, 21 May 1997 07:01:17 UTC