Re: Proposed restatement of syntax-based interoperability principle ( was RE: Action item on syntax-based interoperability) from noah_mendelsohn@us.ibm.com on 2003-10-24 (www-tag@w3.org from October 2003)

From: <noah_mendelsohn@us.ibm.com>
Date: Fri, 24 Oct 2003 16:32:56 -0400
To: "Champion, Mike" <Mike.Champion@SoftwareAG-USA.com>
Cc: www-tag@w3.org
Message-ID: <OFD6AD1424.27A92C22-ON85256DC9.0062F5E6@lotus.com>
Reading this thread, I'd be curious to hear from some of the original 
authors of XML:  to what degree did you believe you were only establishing 
syntax and to what degree a model + syntax  (not necessarily the infoset 
in particular, I mean a model implicit in the XML Rec.)

Here's what I mean.   If you're only establishing syntax, then you are (I 
think) merely indicating that the following forms are legal:

        <e></e>
        <e/>
        <e a="1"></e>
        <e a='1'></e>
        <e a="1"/>
        <e a='1'/>

while specifying that the following is not:


        <a></b>

Independent of the Infoset or XPath data models, I think the XML 
Recommendation supports the inference that the following groups are to be 
distinguished in the legal forms above:

Group 1 (Data model:  an empty element named "e"):
        <e></e>
        <e/>

Group 2 (Data model: an empty element "e" with attribute "a" of value 
"1'):
        <e a="1"></e>
        <e a='1'></e>
        <e a="1"/>
        <e a='1'/>


Thus, group 1 is part of an equivalence class, as is group 2, but those 
are separate classes.  What licenses this interpretation?  Well, I claim 
that it's because there is a data model implied by XML.  That model says: 
there is a document at the root, with a single root element associated 
with that document.  Each element has zero or more attributes, with the 
choice of single or double quotes on attributes typically being 
uninteresting for many purposes, and so on.    So, I claim that at least 
some sort of model is implied in the XML Recommendation itself. 

A question is:  are we better off on balance having set down that model 
clearly, and layered it separately from the syntax?  I think so.  As 
others have observed, one of the reasons the model is there is because you 
want to talk about what is significant for processing:  for many purposes, 
it's the underlying elements and attributes that are significant in the 
example above, not the choice of single or double quotes.

To make this easy to talk about, we can set down just the information 
that's significant in a model such as the Infoset.  I think this is 
unambiguously a good step to have taken, modulo some discomfort regarding 
the lack of integration between the XML, Namespaces, and Infoset Rec 
documents.  Indeed, l would (oh heresy) have preferred to see the data 
model terminology introduced first, which would then allow you to say in 
the XML Rec:  "an attribute information item is serialized in the form 
a="1" or a='1' "  I also think it's a real mess that we are now up to 
three overlapping data models, plus the one implied by XML itself.  We've 
got XPath 1.0 data model used by XSL 1.0 and the standard c14n's;  Infoset 
used by Schema, SOAP etc.; and the new XQuery model that captures the 
value-space to lexical-space associations, needed to support operations 
such as bounds checks on the types like integer that were introduced with 
Schema.  In principle, it would be nice to have one scaleable model, 
rather than three that overlap to such a significant degree, IMO.  Still, 
I think the need for model(s) is compelling.

As others have argued, part of the compelling value of XML is that there 
is a single serialization syntax  that covers most use cases.  Indeed, XML 
1.0 syntax should be used wherever practical.  Just because the data 
models are well specified does not mean we need to standardize or 
encourage use of multiple serial forms.   Such alternate representations 
should be used only when the use cases are compelling, or internally to 
particular optimized implementations.  As I've said in the debates on 
binary XML, I think the bar should be set very, very high in justifying 
the standardization of any such alternate serial forms.

On the other hand, I think it's clear that in memory, for the on-disk 
structures of an XML database, or perhaps even on very slow links or small 
memories, there are good reasons to optimize the representation of XML. In 
defining such representations, it's useful to know that the even the XML 
Rec suggests that the differences between a='1' and a="1" may not be 
significant.  The Data Model Recommendation(s) capture that.

So, I think that some form of explicit model is important, indeed 
necessary.  We must then avoid the temptation to use the existence of a 
data model as an excuse for a proliferation of non-standard or even 
standardized serial syntaxes.  I think we can avoid those temptations 
while still benefiting from clear documentation of the models that I 
believe have been implicit in XML from day 1 anyway.

------------------------------------------------------------------
Noah Mendelsohn                              Voice: 1-617-693-4036
IBM Corporation                                Fax: 1-617-693-8676
One Rogers Street
Cambridge, MA 02142
------------------------------------------------------------------
Received on Friday, 24 October 2003 16:35:23 UTC