Re: ACTION NW xmlChunk-44: Chunk of XML - Canonicalization and equality from noah_mendelsohn@us.ibm.com on 2004-06-29 (www-tag@w3.org from June 2004)

From: <noah_mendelsohn@us.ibm.com>
Date: Tue, 29 Jun 2004 15:12:33 -0400
To: Norman Walsh <Norman.Walsh@Sun.COM>
Cc: www-tag@w3.org
Message-ID: <OF36417565.A43C569F-ON85256EC2.0069CE08@lotus.com>

This all makes sense and is more or less what I thought or was hoping
you'd say. I do think it might be worth a bit of explicit discussion of
the versioning issue, just so your readers will be clar that these choices
are indeed intentional.

--------------------------------------
Noah Mendelsohn
IBM Corporation
One Rogers Street
Cambridge, MA 02142
1-617-693-4036
--------------------------------------

Norman Walsh <Norman.Walsh@Sun.COM>
Sent by: www-tag-request@w3.org
06/29/04 03:01 PM

To: www-tag@w3.org
cc: (bcc: Noah Mendelsohn/Cambridge/IBM)
Subject: Re: ACTION NW xmlChunk-44: Chunk of XML - Canonicalization and equality

/ noah_mendelsohn@us.ibm.com was heard to say:
| Norm: I'm curious about a few things. Have you given any thought to
the
| XML 1.0/XML 1.1 issue as it relates to your writeup.

Yes. In my definition of infoset equality, the XML version is
irrelevant.

In other words, I *do* think that these two documents are the same:

<?xml version='1.0'?>
<doc/>

<?xml version='1.1'?>
<doc/>

I can imagine applications that care if they're getting 1.0 or 1.1
(though I have to work pretty hard at it and the examples that I can
think of are generally pretty contrived). I can't imagine any useful
purpose in saying the two documents above are not the same.

| As best I can tell,
| Infoset has no formal notion of the distinction, but implies that if a
| (non-synthetic) Infoset resulting from the parse of an actual serialized

| document results in a [version] property on the document information
item,
| then that version applies in some sense to all descendants. There is no

| conformance rule relating to the possibility that, for example, element
| names would in fact be consistent with the apparent version. I can also

| see no indication that versions must be applied consistently in the case

| that a synthetic infoset is constructed.

Right, I think it's up to specs that use the infoset to explain the
conformance requirements.

| I'm not particularly advocating one answer or another, but I think it
| would be useful to verify that
|
| a) Two document info items that differ only in [version] are or or not
| equal?

I think they are equal. I should have made that explicit.

| b) Is it even meaningful to compare two elements taken out of the
context
| of their enclosing documents. I believe it's fair to say that the
Infoset
| rec is silent as to whether such things can exist or be dealt with in
| isolation; I note that the schema recommendation currently claims to
| validate element info items and doesn't mention document info items one
| way or the other. We are in fact debating whether it is sensible to
imply
| that one can invariably walk up to some document info item ancestor to
| determine an XML version. Anyway, in whatever way it is meaningful to
| compare element info items "out of context", we need to decide whether
XML
| versions enter into the equation, and how that relates to our story at
the
| document info item level.

What's different in 1.1?

1. There are more Unicode characters in XML Names.
2. You can't have C1 control characters if they're unescaped.
3. You can have C0 control characters (except NUL) if they're escaped.
4. The NEL character is normalized to LF in text content.
5. There's a more explicit nod towards Unicode normalization

How does this impact our definition of "the same"?

1. Either two names use the exact same characters or they don't. No
1.0 name will ever "accidentally" be the same as a 1.1 name.

2-4. Are all serialization issues. My proposed equality function
doesn't detect the difference between "&65;" and "A" so I don't see
why it should care about how the C0 or C1 controls are presented in
the serialized form (if it ever existed).

One possible difference is that a NEL from an IBM mainframe might
have been normalized to LF. But the whole point of that change is
to allow NEL to be treated like LF. If two documents are the same
after that transformation, then they were always meant to be the
same.

If I have NEL characters that aren't newlines in my non-mainframe
application, they'll still be NEL in the infoset unless I move to
1.1, and I better solve this problem before I move to 1.1, so I
think it's still reasonable for the infoset equality function to
ignore this as a serialization issue.

5. If one document is normalized and the other isn't, they won't be
equal. But I bet the I18N folks think you SHOULD normalize XML 1.0
documents too.

In other words, all of the changes between 1.0 and 1.1 are exposed in
other properties. No 1.1 document that exercises any of these
differences will ever be the same as a 1.0 document that does not.

Be seeing you,
norm

--
Norman.Walsh@Sun.COM / XML Standards Architect / Sun Microsystems, Inc.
NOTICE: This email message is for the sole use of the intended
recipient(s) and may contain confidential and privileged information.
Any unauthorized review, use, disclosure or distribution is prohibited.
If you are not the intended recipient, please contact the sender by
reply email and destroy all copies of the original message.

Attachments

application/octet-stream attachment: att89cpy.dat

Received on Tuesday, 29 June 2004 15:16:29 UTC