W3C home > Mailing lists > Public > www-tag@w3.org > June 2004

Re: ACTION NW xmlChunk-44: Chunk of XML - Canonicalization and equality

From: <noah_mendelsohn@us.ibm.com>
Date: Tue, 29 Jun 2004 15:12:33 -0400
To: Norman Walsh <Norman.Walsh@Sun.COM>
Cc: www-tag@w3.org
Message-ID: <OF36417565.A43C569F-ON85256EC2.0069CE08@lotus.com>
This all makes sense and is more or less what I thought or was hoping 
you'd say.  I do think it might be worth a bit of explicit discussion of 
the versioning issue, just so your readers will be clar that these choices 
are indeed intentional.

--------------------------------------
Noah Mendelsohn 
IBM Corporation
One Rogers Street
Cambridge, MA 02142
1-617-693-4036
--------------------------------------








Norman Walsh <Norman.Walsh@Sun.COM>
Sent by: www-tag-request@w3.org
06/29/04 03:01 PM

 
        To:     www-tag@w3.org
        cc:     (bcc: Noah Mendelsohn/Cambridge/IBM)
        Subject:        Re: ACTION NW xmlChunk-44: Chunk of XML - Canonicalization and equality


/ noah_mendelsohn@us.ibm.com was heard to say:
| Norm:  I'm curious about a few things.  Have you given any thought to 
the 
| XML 1.0/XML 1.1 issue as it relates to your writeup.

Yes. In my definition of infoset equality, the XML version is
irrelevant.

In other words, I *do* think that these two documents are the same:

<?xml version='1.0'?>
<doc/>

<?xml version='1.1'?>
<doc/>

I can imagine applications that care if they're getting 1.0 or 1.1
(though I have to work pretty hard at it and the examples that I can
think of are generally pretty contrived). I can't imagine any useful
purpose in saying the two documents above are not the same.

| As best I can tell, 
| Infoset has no formal notion of the distinction, but implies that if a 
| (non-synthetic) Infoset resulting from the parse of an actual serialized 

| document results in a [version] property on the document information 
item, 
| then that version applies in some sense to all descendants.  There is no 

| conformance rule relating to the possibility that, for example, element 
| names would in fact be consistent with the apparent version.  I can also 

| see no indication that versions must be applied consistently in the case 

| that a synthetic infoset is constructed.

Right, I think it's up to specs that use the infoset to explain the
conformance requirements.

| I'm not particularly advocating one answer or another, but I think it 
| would be useful to verify that 
|
| a) Two document info items that differ only in [version] are or or not 
| equal?

I think they are equal. I should have made that explicit.

| b) Is it even meaningful to compare two elements taken out of the 
context 
| of their enclosing documents.  I believe it's fair to say that the 
Infoset 
| rec is silent as to whether such things can exist or be dealt with in 
| isolation;  I note that the schema recommendation currently claims to 
| validate element info items and doesn't mention document info items one 
| way or the other.  We are in fact debating whether it is sensible to 
imply 
| that one can invariably walk up to some document info item ancestor to 
| determine an XML version.  Anyway, in whatever way it is meaningful to 
| compare element info items "out of context", we need to decide whether 
XML 
| versions enter into the equation, and how that relates to our story at 
the 
| document info item level.

What's different in 1.1?

  1. There are more Unicode characters in XML Names.
  2. You can't have C1 control characters if they're unescaped.
  3. You can have C0 control characters (except NUL) if they're escaped.
  4. The NEL character is normalized to LF in text content.
  5. There's a more explicit nod towards Unicode normalization

How does this impact our definition of "the same"?

1. Either two names use the exact same characters or they don't. No
   1.0 name will ever "accidentally" be the same as a 1.1 name.

2-4. Are all serialization issues. My proposed equality function
   doesn't detect the difference between "&65;" and "A" so I don't see
   why it should care about how the C0 or C1 controls are presented in
   the serialized form (if it ever existed).

   One possible difference is that a NEL from an IBM mainframe might
   have been normalized to LF. But the whole point of that change is
   to allow NEL to be treated like LF. If two documents are the same
   after that transformation, then they were always meant to be the
   same.

   If I have NEL characters that aren't newlines in my non-mainframe
   application, they'll still be NEL in the infoset unless I move to
   1.1, and I better solve this problem before I move to 1.1, so I
   think it's still reasonable for the infoset equality function to
   ignore this as a serialization issue.

5. If one document is normalized and the other isn't, they won't be
   equal. But I bet the I18N folks think you SHOULD normalize XML 1.0
   documents too.

In other words, all of the changes between 1.0 and 1.1 are exposed in
other properties. No 1.1 document that exercises any of these
differences will ever be the same as a 1.0 document that does not.

                                        Be seeing you,
                                          norm

-- 
Norman.Walsh@Sun.COM / XML Standards Architect / Sun Microsystems, Inc.
NOTICE: This email message is for the sole use of the intended
recipient(s) and may contain confidential and privileged information.
Any unauthorized review, use, disclosure or distribution is prohibited.
If you are not the intended recipient, please contact the sender by
reply email and destroy all copies of the original message.




Received on Tuesday, 29 June 2004 15:16:29 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Thursday, 26 April 2012 12:47:26 GMT