- From: Jonathan Borden <jborden@mediaone.net>
- Date: Thu, 17 May 2001 13:00:13 -0400
- To: "Drew McDermott" <drew.mcdermott@yale.edu>, <www-rdf-logic@w3.org>
Drew McDermott wrote: > > [jonathan borden] > > Drew McDermott wrote: > > > ...The problem is that DAML has inherited from the SGML/XML > > tradition this vagueness about exactly what the leaves of the tree are > > in a marked-up document. > > err... http://www.w3.org/TR/REC-xml provides a set of 89 EBNF productions > that precisely define the XML abstract syntax on top of a UNICODE character > stream. > > SGML has had a tradition of precise specification regarding every aspect of > the trees it describes (it's called Groves): > http://www.prescod.net/groves/shorttut/ > > You also mention the XML Schema datatype framework > http://www.w3.org/TR/xmlschema-2 > which has an elaborate set of datatypes. > > Then there's RDF, which appears to do things in yet another way. oh yes I agree with this problem, and have voiced this opinion. My sole objection is that _XML_ precisely specifies what it specifies using EBNF productions. XML is a character based specification, of course, and does not itself specify a mapping of character data to 'binary' data but XML 1.0 doesn't make this claim. The XML 1.0 is a self contained specification that defines an abstract syntax for XML as well as a DTD (schema) constraint mechanism. XML 1.0 has its own notion of types (e.g. DTD is short for "Document Type Definition" and an XML 1.0 element type is its name). Other specifications, e.g. RDF, XML Schema etc. define their own type mechanisms and in different ways. In the current situation, for the reasons you mention as well as others, there is no coherent notion of types. I have suggested that we might best use predicate calculus and set membership to define a schema independent (i.e. XML Schema and RDF and TREX and DTD indenpendent) notion of types -- I am not the first to suggest datatyping in this manner. But in the context of the current extended family of specifications. see http://www.rddl.org/SchemaAlgebra -- note that this was picked up by xmlhack as part of the "Want do ontologists want?" thread. http://www.xmlhack.com/read.php?item=1220 The basis for this is briefly that a document can be 'typed' in relation to a schema by whether the document is 'valid' with respect to the schema. Nodes within the document can similarly be typed with respect to a class on the basis of membership in the instance set of the class. Membership in the instance set of a schema or class is defined by the schema and/or class format. For example considering XML Schema 'lexical space' one can define membership in a particular class based on matching a regular expression or EBNF production. > > Now if we look at the XML Schema datatypes report, section 3.3 or > thereabouts, we find the datatypes that are concerned with things that > looks like names (or tokens or identifiers). Section 3.3.3 > characterizes the type Language. An example of an occurrence of a > datum of this type is "en-gb", meaning "English as spoken in Great > Britain." This looks a lot like the case of "x" above; en-gb is an > identifier of type Language, and it refers to a particular dialect. > But now look at section 3.3.6, where we have the datatype Name, with > subtypes ID, IDREF, etc. Here the entire focus is on what names look > like in XML documents. The definitions ultimately point back to the > productions Jonathan mentioned from the XML 1.0 spec. The technical > spec in the XML Schema Datatypes report summarizes thus: > > The value space of ID is the set of all strings that match the > NCName production in [Namespaces in XML]. The lexical space of ID is > the set of all strings that match the NCName production in > [Namespaces in XML]. > > "Value space" and "lexical space" are technical terms introduced in > the XML Schema Datatypes report, best defined by example: The value > space of Integer is the set of all integers; the lexical space is the > set of all nonempty strings of decimal digits preceded by an optional > sign. (This is my simplified definition.) This is exact agreement with my definitions. > This works great for > Integer, but for Name and its derived types I think there is a big bug > here. To refresh our memories, the ID and IDREF types are used in the > following way: I say > <sometag ID="important"> ... </sometag> > in one place and > <othertag IDREF="#important"/> > in another. The idea is of course to allow the second element to > refer to the first. Hence the value space of IDs ought to be > ... elements, no? Just as "x" denotes an integer in my little > program, "important" denotes that first element. I am not going to defend this distinction made by XML Schema datatypes. > > Of course, this is the way IDs are actually used by the XML > community. But the spec says something else entirely. So it's hard > to sort out the semantics in a rigorous. I think that's why the > DAML+OIL report seems fuzzy on this issue, and why everyone seems to > read it slightly differently. Again, many many people in the XML community consider the definition of ID and IDREF to derive from SGML via XML 1.0. There is no agreement that XML Schema datatypes supercedes these particular definitions ... on the other hand if you are at all intested in xsd:unsignedShortInteger's XML itself has nothing to say in this regard. > > Here's another interesting confusion: > > At the beginning of section 3.2 of the XML Schema Datatype report, it says > > The primitive datatypes defined by this specification are described > below. For each datatype, the value space and lexical space are > defined.... > > The only exception is the very first datatype in the list: String! It > has no lexical space. The reason is that the lexical spaces of all > other datatypes are strings of characters, and the designers of XML > did not want to require any kind of string delimiter. ??? Again in XML 1.0 productions such as "CharData" the 'lexical space' of String is defined (or to put it another way, perhaps XML Schema assumes that the 'lexical space' of this datatype is as defined in XML 1.0. > So it's impossible to tell out of context whether the occurrence of > t-r-u-e in > > <tag ...>true</tag> or <tag someattr="true"> > > is a Boolean (true vs. false) or a String ("true" vs "grue"). This is correct. Nothing in XML 1.0 + Namespaces allows you to make any claim on how "true" is to be interpreted. On the other hand you _can_ constrain a vocabulary based on element names: e.g. <logic:true/> <logic:false/> ... and you can constrain the range of values an attribute might have: <!ATTLIST logic:value (true|false) #REQUIRED> or a piece of software _which uses_ XML 1.0 (such as an XML Schema validator/processor) can 'attach' a datatype to an attribute value node: <foo:bar logic:value="true"/> similarly a QName can be used: <foo:bar logic:value="logic:true"/> > My claim was that a lot of these practices descend from SGML/HTML > tradition; I still think that's true. > The SGML/XML tradition _firmly_ is to represent textual data, character encodings etc. You will find very little ... or nothing ... assigning _meaning_ to character tokens such as "true" and "false". The SGML/XML tradition is to make a sharp and unequivocal distinction on these matters, _specifically_ so as not to cause the sorts of problems you identify. Understanding this, and if you are prepared to assign such semantics yourself, you might find it a pleasure to work with (it won't step on your toes in this regard either). Jonathan Borden The Open Healthcare Group http://www.openhealth.org
Received on Thursday, 17 May 2001 13:16:35 UTC