Re: abstract XML and bytes-on-the-wire interoperability from noah_mendelsohn@us.ibm.com on 2005-04-24 (www-tag@w3.org from April 2005)

From: <noah_mendelsohn@us.ibm.com>
Date: Sun, 24 Apr 2005 08:47:58 -0400
To: Dan Connolly <connolly@w3.org>
Cc: www-tag@w3.org
Message-ID: <OF9FE045A4.73C78B27-ON85256FEC.0051B742-85256FED.004651AB@lotus.com>
Dan Connolly wrote:

> Noah said he could see both sides...

>  [[ NM: I think both views of this are right,
> there is a case to be said that the infoset way is
> architecturally better ...

Yes, because it allows our recommendations to apply to in memory 
representations which have never been serialized, or to other cases in 
which you wish a non-XML serialization of the data. 

 OTOH, Dan is right as
> well and we need to provide [details missing] ]]
>   -- http://www.w3.org/2001/tag/2005/04/19-minutes#item05

> I think the [details missing] was something about
> interoperability at the bytes-on-the-wire level.

Right, that's what I meant.  Wire-level or file-format-level definitions 
are what's needed to achieve actual interoperation.  I believe Mike 
Champion covered this well in his followup note.  Such interoperation is 
critical to the success of the Web and of XML.  That's exactly what's 
"threatened" by the promotion of Binary XML as an alternative 
serialization:  existing implementations that work with XML today will 
fail to interoperate with this new form of XML.

Your discussion of XML schema seems to miss the simplest sense in which 
XML Schema is Infoset-based:  the instances to be validated are modeled as 
Infosets.   Consider, for example, an XQuery-based system in which it 
might be asserted that certain XML fragments resulting from a query are to 
be schema-valid per some element declaration.  Because, the Schema 
Recommendation is Infoset-based, the validation can be performed against 
any in memory representation that might be convenient.  If the Schema 
Recommendation were written to validate only XML documents, then the 
impementation would either have to actually serialize the result to enable 
validation, or would have to ensure that its implementation produced 
results that are the same as if serialization had been done.    In any 
case, my reference to Infosets was primarily with regard to the data to be 
validated, and only indirectly to the representations of schema documents 
themselves.

Since you've also gone into the latter, here are a few additional comments 
on your note:

> I just happened to be looking at how URIs interact
> with XML specifications, and I discovered
> (rediscovered?) that XML Schema has conformance
> clauses at three levels:
>
>   (1) the component level, where even the 
>       infoset representation of a 
>       schema is abstracted away

Yes, but I think it's worth using the terminology of the recommendation, 
and then clarifying a few details.  The schema Recommendation establishes 
the following terminology:

schema:  the information needed (in addition to the instance itself) to 
perform a validation.   Note that a single schema integrates declarations 
for multiple namespaces, and for non-namespaced constructs in a unform and 
symmetrical way.   Each validation uses one schema and one instance, 
regardless of the number of namespaces involved. 

component:  the schema is organized into components, mostly for reasons of 
clarity.  Thus, there is a component for each element declaration, each 
type, etc.  Like the schema as a whole, components are abstract.  They 
tell you the information you need to perform a validation, not the form in 
which that information is to be stored or communicated.  As an example, if 
you know the qualified name of an element and the fact that its type is 
xsd:integer (and a few other bits), you can validate the element as an 
integer.  At this level, we don't constrain the manner in which you set 
down or communicate the name and type of the element.

schema document:  an element information item with qualified element name 
<xsd:schema> (using usual namespace bindings).  We further state that in 
the common case where such an element is the root element of an XML 
document, and where that document is "on the web" (has been given a URI as 
opposed to, say, being offered directly through a Java InputStream), the 
media type should be "application/xml".

With that background, I can quibble a bit with the above.  Saying 
"abstracted away" implies that there was in all cases an Infoset from 
which to abstract, but that is not the primary focus of this level of 
conformance.  The primary focus of establishing a level of component-based 
conformance is to deal with the (less common) case where schema 
information has been directly synthesized using some form other than 
schema documents.  Imagine a dynamic API along the lines of 
"createElementDeclaration";  as long as it lets you specify the element 
name, the type, and whatever else the component requires, our 
Recommendation applies.  In such cases, the Infoset has not been 
abstracted away, because it never existed.

> 
>   (2) "conformance to the XML Representation 
>       of Schemas" which is actually at the
>        infoset level

Right.  This is actually about Schema Documents, and these are organized 
by target namespace.  Note that, when inheritance across namespaces is 
involved, much of the information for a given component may be inherited 
from components not overtly declared in the document in hand.  This is 
another sense in which the use of the phrase "abstracted away" is a bit 
misleading.   The component constructed from the markup in a given schema 
document (infoset) may have information well beyond that found in the 
markup.  That inherited information may have come from other schema 
documents, or from synthetic components (e.g. from the API postulated 
above).
 
> plus another that we didn't get into in the
> teleconference:
> 
>   (3) it has an explicit conformance clause for
>       processors that aren't running on some
>       disconnected LAN that has its own DNS root,
>       but have access to to the captial-I
>       Internet.

I'd have to check, but I don't think we said it quite that way (I'm in the 
car at the moment and can't easily get to the details).  As I recall, we 
turn that logic upside down relative to your summary and deal first with 
cases where DNS is not an issue at all.  We start with a general 
discussion of the case where there is a schema document infoset, and thus 
a corresponding XML 1.0 serialization as a schema document.  We can deal 
with such documents regardless of whether they have ever been given a URI 
and have in that sense been on the Web at all.  For example, if you built 
a Java-based system and just used ordinary Java filesystem I/O to access 
the XML streams, we would consider those conforming schema documents and 
the recommendation would apply.  Likewise for a relational database that 
stored such schema documents in tables, and named them with 
(non-URI-based) primary keys.  Since processors have discretion to find 
such documents in processor-specific manner, we don't have to say anything 
about how one processor or another chooses files to use for its schemas.

With that layer of conformance in hand, we add the web on top.  We say 
that there is a particular but very important case where the documents 
have been given URIs and are accessible through the mechanisms of the Web. 
 In this case we call for use of media type application/xml, and for the 
usual mechanisms of the Web to be used for retrieval.   We call this third 
level "fully conforming", in part to encourage its use..

I'm not sure I see where you are picking up a suggestion to use private 
DNS roots.  There is the case where you have referenced schema documents 
in an instance or schema by URI using schemaLocations.  I don't think we 
particularly encourage the use of private DNS roots, except insofar as we 
recognize that certain disconnected systems may wish to have rather 
specially managed proxy caches of schema documents.  So, it would be 
reasonable for a relational database to maintain in its store a set of 
{URI,schema-document} pairs to be used as caches for representations of 
the named documents.  I don't >think< that implies a private DNS root.  I 
believe that such systems can be considered fully conforming insofar as 
the caches are legitimate proxies for the schema document web resources.
 
> Interesting stuff.
>  http://www.w3.org/TR/xmlschema-1/#concepts-conformance

Not "interesting" in the sense of the Confuscian curse, I hope?

Noah

--------------------------------------
Noah Mendelsohn 
IBM Corporation
One Rogers Street
Cambridge, MA 02142
1-617-693-4036
--------------------------------------
Received on Sunday, 24 April 2005 12:48:08 UTC