- From: <noah_mendelsohn@us.ibm.com>
- Date: Fri, 24 Oct 2003 16:32:56 -0400
- To: "Champion, Mike" <Mike.Champion@SoftwareAG-USA.com>
- Cc: www-tag@w3.org
Reading this thread, I'd be curious to hear from some of the original
authors of XML: to what degree did you believe you were only establishing
syntax and to what degree a model + syntax (not necessarily the infoset
in particular, I mean a model implicit in the XML Rec.)
Here's what I mean. If you're only establishing syntax, then you are (I
think) merely indicating that the following forms are legal:
<e></e>
<e/>
<e a="1"></e>
<e a='1'></e>
<e a="1"/>
<e a='1'/>
while specifying that the following is not:
<a></b>
Independent of the Infoset or XPath data models, I think the XML
Recommendation supports the inference that the following groups are to be
distinguished in the legal forms above:
Group 1 (Data model: an empty element named "e"):
<e></e>
<e/>
Group 2 (Data model: an empty element "e" with attribute "a" of value
"1'):
<e a="1"></e>
<e a='1'></e>
<e a="1"/>
<e a='1'/>
Thus, group 1 is part of an equivalence class, as is group 2, but those
are separate classes. What licenses this interpretation? Well, I claim
that it's because there is a data model implied by XML. That model says:
there is a document at the root, with a single root element associated
with that document. Each element has zero or more attributes, with the
choice of single or double quotes on attributes typically being
uninteresting for many purposes, and so on. So, I claim that at least
some sort of model is implied in the XML Recommendation itself.
A question is: are we better off on balance having set down that model
clearly, and layered it separately from the syntax? I think so. As
others have observed, one of the reasons the model is there is because you
want to talk about what is significant for processing: for many purposes,
it's the underlying elements and attributes that are significant in the
example above, not the choice of single or double quotes.
To make this easy to talk about, we can set down just the information
that's significant in a model such as the Infoset. I think this is
unambiguously a good step to have taken, modulo some discomfort regarding
the lack of integration between the XML, Namespaces, and Infoset Rec
documents. Indeed, l would (oh heresy) have preferred to see the data
model terminology introduced first, which would then allow you to say in
the XML Rec: "an attribute information item is serialized in the form
a="1" or a='1' " I also think it's a real mess that we are now up to
three overlapping data models, plus the one implied by XML itself. We've
got XPath 1.0 data model used by XSL 1.0 and the standard c14n's; Infoset
used by Schema, SOAP etc.; and the new XQuery model that captures the
value-space to lexical-space associations, needed to support operations
such as bounds checks on the types like integer that were introduced with
Schema. In principle, it would be nice to have one scaleable model,
rather than three that overlap to such a significant degree, IMO. Still,
I think the need for model(s) is compelling.
As others have argued, part of the compelling value of XML is that there
is a single serialization syntax that covers most use cases. Indeed, XML
1.0 syntax should be used wherever practical. Just because the data
models are well specified does not mean we need to standardize or
encourage use of multiple serial forms. Such alternate representations
should be used only when the use cases are compelling, or internally to
particular optimized implementations. As I've said in the debates on
binary XML, I think the bar should be set very, very high in justifying
the standardization of any such alternate serial forms.
On the other hand, I think it's clear that in memory, for the on-disk
structures of an XML database, or perhaps even on very slow links or small
memories, there are good reasons to optimize the representation of XML. In
defining such representations, it's useful to know that the even the XML
Rec suggests that the differences between a='1' and a="1" may not be
significant. The Data Model Recommendation(s) capture that.
So, I think that some form of explicit model is important, indeed
necessary. We must then avoid the temptation to use the existence of a
data model as an excuse for a proliferation of non-standard or even
standardized serial syntaxes. I think we can avoid those temptations
while still benefiting from clear documentation of the models that I
believe have been implicit in XML from day 1 anyway.
------------------------------------------------------------------
Noah Mendelsohn Voice: 1-617-693-4036
IBM Corporation Fax: 1-617-693-8676
One Rogers Street
Cambridge, MA 02142
------------------------------------------------------------------
Received on Friday, 24 October 2003 16:35:23 UTC