Re: DAML ObjectProp vs DatatypeProp from Jonathan Borden on 2001-05-17 (www-rdf-logic@w3.org from May 2001)

From: Jonathan Borden <jborden@mediaone.net>
Date: Thu, 17 May 2001 13:00:13 -0400
To: "Drew McDermott" <drew.mcdermott@yale.edu>, <www-rdf-logic@w3.org>
Message-ID: <135601c0def2$d77b2e40$0a2e249b@nemc.org>
Drew McDermott wrote:
>
>    [jonathan borden]
>
>    Drew McDermott wrote:
>
>    > ...The problem is that DAML has inherited from the SGML/XML
>    > tradition this vagueness about exactly what the leaves of the tree
are
>    > in a marked-up document.
>
>    err... http://www.w3.org/TR/REC-xml provides a set of 89 EBNF
productions
>    that precisely define the XML abstract syntax on top of a UNICODE
character
>    stream.
>
>    SGML has had a tradition of precise specification regarding every
aspect of
>    the trees it describes (it's called Groves):
>    http://www.prescod.net/groves/shorttut/
>
> You also mention the XML Schema datatype framework
> http://www.w3.org/TR/xmlschema-2
> which has an elaborate set of datatypes.
>
> Then there's RDF, which appears to do things in yet another way.

oh yes I agree with this problem, and have voiced this opinion. My sole
objection is that _XML_ precisely specifies what it specifies using EBNF
productions. XML is a character based specification, of course, and does not
itself specify a mapping of character data to 'binary' data but XML 1.0
doesn't make this claim. The XML 1.0 is a self contained specification that
defines an abstract syntax for XML as well as a DTD (schema) constraint
mechanism. XML 1.0 has its own notion of types (e.g. DTD is short for
"Document Type Definition" and an XML 1.0 element type is its name).

Other specifications, e.g. RDF, XML Schema etc. define their own type
mechanisms and in different ways. In the current situation, for the reasons
you mention as well as others, there is no coherent notion of types.

I have suggested that we might best use predicate calculus and set
membership to define a schema independent (i.e. XML Schema and RDF and TREX
and DTD indenpendent) notion of types -- I am not the first to suggest
datatyping in this manner. But in the context of the current extended family
of specifications. see http://www.rddl.org/SchemaAlgebra -- note that this
was picked up by xmlhack as part of the "Want do ontologists want?" thread.
http://www.xmlhack.com/read.php?item=1220

The basis for this is briefly that a document can be 'typed' in relation to
a schema by whether the document is 'valid' with respect to the schema.
Nodes within the document can similarly be typed with respect to a class on
the basis of membership in the instance set of the class.

Membership in the instance set of a schema or class is defined by the schema
and/or class format. For example considering XML Schema 'lexical space' one
can define membership in a particular class based on matching a regular
expression or EBNF production.

>
> Now if we look at the XML Schema datatypes report, section 3.3 or
> thereabouts, we find the datatypes that are concerned with things that
> looks like names (or tokens or identifiers).  Section 3.3.3
> characterizes the type Language.  An example of an occurrence of a
> datum of this type is "en-gb", meaning "English as spoken in Great
> Britain."  This looks a lot like the case of "x" above; en-gb is an
> identifier of type Language, and it refers to a particular dialect.
> But now look at section 3.3.6, where we have the datatype Name, with
> subtypes ID, IDREF, etc.  Here the entire focus is on what names look
> like in XML documents.  The definitions ultimately point back to the
> productions Jonathan mentioned from the XML 1.0 spec.  The technical
> spec in the XML Schema Datatypes report summarizes thus:
>
>   The value space of ID is the set of all strings that match the
>   NCName production in [Namespaces in XML]. The lexical space of ID is
>   the set of all strings that match the NCName production in
>   [Namespaces in XML].
>
> "Value space" and "lexical space" are technical terms introduced in
> the XML Schema Datatypes report, best defined by example:  The value
> space of Integer is the set of all integers; the lexical space is the
> set of all nonempty strings of decimal digits preceded by an optional
> sign.  (This is my simplified definition.)

This is exact agreement with my definitions.

> This works great for
> Integer, but for Name and its derived types I think there is a big bug
> here.  To refresh our memories, the ID and IDREF types are used in the
> following way: I say
>    <sometag ID="important"> ... </sometag>
> in one place and
>    <othertag IDREF="#important"/>
> in another.  The idea is of course to allow the second element to
> refer to the first.  Hence the value space of IDs ought to be
> ... elements, no?  Just as "x" denotes an integer in my little
> program, "important" denotes that first element.

I am not going to defend this distinction made by XML Schema datatypes.

>
> Of course, this is the way IDs are actually used by the XML
> community.  But the spec says something else entirely.  So it's hard
> to sort out the semantics in a rigorous.  I think that's why the
> DAML+OIL report seems fuzzy on this issue, and why everyone seems to
> read it slightly differently.

Again, many many people in the XML community consider the definition of ID
and IDREF to derive from SGML via XML 1.0. There is no agreement that XML
Schema datatypes supercedes these particular definitions ... on the other
hand if you are at all intested in xsd:unsignedShortInteger's XML itself has
nothing to say in this regard.

>
> Here's another interesting confusion:
>
> At the beginning of section 3.2 of the XML Schema Datatype report, it says
>
>    The primitive datatypes defined by this specification are described
>    below. For each datatype, the value space and lexical space are
>    defined....
>
> The only exception is the very first datatype in the list: String!  It
> has no lexical space.  The reason is that the lexical spaces of all
> other datatypes are strings of characters, and the designers of XML
> did not want to require any kind of string delimiter.

??? Again in XML 1.0 productions such as "CharData" the 'lexical space' of
String is defined (or to put it another way, perhaps XML Schema assumes that
the 'lexical space' of this datatype is as defined in XML 1.0.

> So it's impossible to tell out of context whether the occurrence of
> t-r-u-e in
>
>      <tag ...>true</tag>      or     <tag someattr="true">
>
> is a Boolean (true vs. false) or a String ("true" vs "grue").

This is correct. Nothing in XML 1.0 + Namespaces allows you to make any
claim on how "true" is to be interpreted. On the other hand you _can_
constrain a vocabulary based on element names: e.g.

<logic:true/>
<logic:false/>

... and you can constrain the range of values an attribute might have:

<!ATTLIST logic:value (true|false) #REQUIRED>

or a piece of software _which uses_ XML 1.0 (such as an XML Schema
validator/processor) can 'attach' a datatype to an attribute value node:

<foo:bar logic:value="true"/>

similarly a QName can be used:

<foo:bar logic:value="logic:true"/>


> My claim was that a lot of these practices descend from SGML/HTML
> tradition; I still think that's true.
>

The SGML/XML tradition _firmly_ is to represent textual data, character
encodings etc. You will find very little ... or nothing ... assigning
_meaning_ to character tokens such as "true" and "false". The SGML/XML
tradition is to make a sharp and unequivocal distinction on these matters,
_specifically_ so as not to cause the sorts of problems you identify.

Understanding this, and if you are prepared to assign such semantics
yourself, you might find it a pleasure to work with (it won't step on your
toes in this regard either).

Jonathan Borden
The Open Healthcare Group
http://www.openhealth.org
Received on Thursday, 17 May 2001 13:16:35 UTC