- From: Drew McDermott <drew.mcdermott@yale.edu>
- Date: Thu, 17 May 2001 11:14:45 -0400 (EDT)
- To: www-rdf-logic@w3.org
[jonathan borden] Drew McDermott wrote: > ...The problem is that DAML has inherited from the SGML/XML > tradition this vagueness about exactly what the leaves of the tree are > in a marked-up document. err... http://www.w3.org/TR/REC-xml provides a set of 89 EBNF productions that precisely define the XML abstract syntax on top of a UNICODE character stream. SGML has had a tradition of precise specification regarding every aspect of the trees it describes (it's called Groves): http://www.prescod.net/groves/shorttut/ You also mention the XML Schema datatype framework http://www.w3.org/TR/xmlschema-2 which has an elaborate set of datatypes. Then there's RDF, which appears to do things in yet another way. The main problem I find in all of this is an ambivalence about whether the goal is to characterize what expressions *are* or what they *refer to*. In a programming language, we often find expressions like let integer x = 5 in x + 1 What is the type of the second occurrence of x? Integer, of course. Now if we look at the XML Schema datatypes report, section 3.3 or thereabouts, we find the datatypes that are concerned with things that looks like names (or tokens or identifiers). Section 3.3.3 characterizes the type Language. An example of an occurrence of a datum of this type is "en-gb", meaning "English as spoken in Great Britain." This looks a lot like the case of "x" above; en-gb is an identifier of type Language, and it refers to a particular dialect. But now look at section 3.3.6, where we have the datatype Name, with subtypes ID, IDREF, etc. Here the entire focus is on what names look like in XML documents. The definitions ultimately point back to the productions Jonathan mentioned from the XML 1.0 spec. The technical spec in the XML Schema Datatypes report summarizes thus: The value space of ID is the set of all strings that match the NCName production in [Namespaces in XML]. The lexical space of ID is the set of all strings that match the NCName production in [Namespaces in XML]. "Value space" and "lexical space" are technical terms introduced in the XML Schema Datatypes report, best defined by example: The value space of Integer is the set of all integers; the lexical space is the set of all nonempty strings of decimal digits preceded by an optional sign. (This is my simplified definition.) This works great for Integer, but for Name and its derived types I think there is a big bug here. To refresh our memories, the ID and IDREF types are used in the following way: I say <sometag ID="important"> ... </sometag> in one place and <othertag IDREF="#important"/> in another. The idea is of course to allow the second element to refer to the first. Hence the value space of IDs ought to be ... elements, no? Just as "x" denotes an integer in my little program, "important" denotes that first element. Of course, this is the way IDs are actually used by the XML community. But the spec says something else entirely. So it's hard to sort out the semantics in a rigorous. I think that's why the DAML+OIL report seems fuzzy on this issue, and why everyone seems to read it slightly differently. Here's another interesting confusion: At the beginning of section 3.2 of the XML Schema Datatype report, it says The primitive datatypes defined by this specification are described below. For each datatype, the value space and lexical space are defined.... The only exception is the very first datatype in the list: String! It has no lexical space. The reason is that the lexical spaces of all other datatypes are strings of characters, and the designers of XML did not want to require any kind of string delimiter. So it's impossible to tell out of context whether the occurrence of t-r-u-e in <tag ...>true</tag> or <tag someattr="true"> is a Boolean (true vs. false) or a String ("true" vs "grue"). In other words, the lexical space of String is everything and nothing. Of course, we can disambiguate by writing <tag xsi:type="string">true</tag> or by stating in the schema for a particular application how <tag> is to be interpreted. This seems a lot clumsier to me than writing <tag>"true"</tag> My claim was that a lot of these practices descend from SGML/HTML tradition; I still think that's true. -- Drew McDermott
Received on Thursday, 17 May 2001 11:14:46 UTC