Re: DAML ObjectProp vs DatatypeProp from Drew McDermott on 2001-05-17 (www-rdf-logic@w3.org from May 2001)

From: Drew McDermott <drew.mcdermott@yale.edu>
Date: Thu, 17 May 2001 11:14:45 -0400 (EDT)
To: www-rdf-logic@w3.org
Message-Id: <200105171514.LAA26298@pantheon-po01.its.yale.edu>
   [jonathan borden]

   Drew McDermott wrote:

   > ...The problem is that DAML has inherited from the SGML/XML
   > tradition this vagueness about exactly what the leaves of the tree are
   > in a marked-up document.

   err... http://www.w3.org/TR/REC-xml provides a set of 89 EBNF productions
   that precisely define the XML abstract syntax on top of a UNICODE character
   stream.

   SGML has had a tradition of precise specification regarding every aspect of
   the trees it describes (it's called Groves):
   http://www.prescod.net/groves/shorttut/

You also mention the XML Schema datatype framework
http://www.w3.org/TR/xmlschema-2 
which has an elaborate set of datatypes.

Then there's RDF, which appears to do things in yet another way.

The main problem I find in all of this is an ambivalence about whether
the goal is to characterize what expressions *are* or what they *refer
to*.  In a programming language, we often find expressions like

let integer x = 5
in
    x + 1

What is the type of the second occurrence of x?  Integer, of course.

Now if we look at the XML Schema datatypes report, section 3.3 or
thereabouts, we find the datatypes that are concerned with things that
looks like names (or tokens or identifiers).  Section 3.3.3
characterizes the type Language.  An example of an occurrence of a
datum of this type is "en-gb", meaning "English as spoken in Great
Britain."  This looks a lot like the case of "x" above; en-gb is an
identifier of type Language, and it refers to a particular dialect.
But now look at section 3.3.6, where we have the datatype Name, with
subtypes ID, IDREF, etc.  Here the entire focus is on what names look
like in XML documents.  The definitions ultimately point back to the
productions Jonathan mentioned from the XML 1.0 spec.  The technical
spec in the XML Schema Datatypes report summarizes thus:

  The value space of ID is the set of all strings that match the
  NCName production in [Namespaces in XML]. The lexical space of ID is
  the set of all strings that match the NCName production in
  [Namespaces in XML]. 

"Value space" and "lexical space" are technical terms introduced in
the XML Schema Datatypes report, best defined by example:  The value
space of Integer is the set of all integers; the lexical space is the
set of all nonempty strings of decimal digits preceded by an optional
sign.  (This is my simplified definition.)  This works great for
Integer, but for Name and its derived types I think there is a big bug
here.  To refresh our memories, the ID and IDREF types are used in the
following way: I say
   <sometag ID="important"> ... </sometag>
in one place and
   <othertag IDREF="#important"/>
in another.  The idea is of course to allow the second element to
refer to the first.  Hence the value space of IDs ought to be
... elements, no?  Just as "x" denotes an integer in my little
program, "important" denotes that first element.

Of course, this is the way IDs are actually used by the XML
community.  But the spec says something else entirely.  So it's hard
to sort out the semantics in a rigorous.  I think that's why the
DAML+OIL report seems fuzzy on this issue, and why everyone seems to
read it slightly differently. 


Here's another interesting confusion:

At the beginning of section 3.2 of the XML Schema Datatype report, it says

   The primitive datatypes defined by this specification are described
   below. For each datatype, the value space and lexical space are
   defined....

The only exception is the very first datatype in the list: String!  It
has no lexical space.  The reason is that the lexical spaces of all
other datatypes are strings of characters, and the designers of XML
did not want to require any kind of string delimiter.  
So it's impossible to tell out of context whether the occurrence of
t-r-u-e in

     <tag ...>true</tag>      or     <tag someattr="true">

is a Boolean (true vs. false) or a String ("true" vs "grue").  In
other words, the lexical space of String is everything and nothing.
Of course, we can disambiguate by writing 
     <tag xsi:type="string">true</tag>
or by stating in the schema for a particular application how <tag> is
to be interpreted.  This seems a lot clumsier to me than writing
     <tag>"true"</tag>
My claim was that a lot of these practices descend from SGML/HTML
tradition; I still think that's true.

                                             -- Drew McDermott
Received on Thursday, 17 May 2001 11:14:46 UTC