- From: <noah_mendelsohn@us.ibm.com>
- Date: Tue, 29 Jun 2004 15:12:33 -0400
- To: Norman Walsh <Norman.Walsh@Sun.COM>
- Cc: www-tag@w3.org
- Message-ID: <OF36417565.A43C569F-ON85256EC2.0069CE08@lotus.com>
This all makes sense and is more or less what I thought or was hoping you'd say. I do think it might be worth a bit of explicit discussion of the versioning issue, just so your readers will be clar that these choices are indeed intentional. -------------------------------------- Noah Mendelsohn IBM Corporation One Rogers Street Cambridge, MA 02142 1-617-693-4036 -------------------------------------- Norman Walsh <Norman.Walsh@Sun.COM> Sent by: www-tag-request@w3.org 06/29/04 03:01 PM To: www-tag@w3.org cc: (bcc: Noah Mendelsohn/Cambridge/IBM) Subject: Re: ACTION NW xmlChunk-44: Chunk of XML - Canonicalization and equality / noah_mendelsohn@us.ibm.com was heard to say: | Norm: I'm curious about a few things. Have you given any thought to the | XML 1.0/XML 1.1 issue as it relates to your writeup. Yes. In my definition of infoset equality, the XML version is irrelevant. In other words, I *do* think that these two documents are the same: <?xml version='1.0'?> <doc/> <?xml version='1.1'?> <doc/> I can imagine applications that care if they're getting 1.0 or 1.1 (though I have to work pretty hard at it and the examples that I can think of are generally pretty contrived). I can't imagine any useful purpose in saying the two documents above are not the same. | As best I can tell, | Infoset has no formal notion of the distinction, but implies that if a | (non-synthetic) Infoset resulting from the parse of an actual serialized | document results in a [version] property on the document information item, | then that version applies in some sense to all descendants. There is no | conformance rule relating to the possibility that, for example, element | names would in fact be consistent with the apparent version. I can also | see no indication that versions must be applied consistently in the case | that a synthetic infoset is constructed. Right, I think it's up to specs that use the infoset to explain the conformance requirements. | I'm not particularly advocating one answer or another, but I think it | would be useful to verify that | | a) Two document info items that differ only in [version] are or or not | equal? I think they are equal. I should have made that explicit. | b) Is it even meaningful to compare two elements taken out of the context | of their enclosing documents. I believe it's fair to say that the Infoset | rec is silent as to whether such things can exist or be dealt with in | isolation; I note that the schema recommendation currently claims to | validate element info items and doesn't mention document info items one | way or the other. We are in fact debating whether it is sensible to imply | that one can invariably walk up to some document info item ancestor to | determine an XML version. Anyway, in whatever way it is meaningful to | compare element info items "out of context", we need to decide whether XML | versions enter into the equation, and how that relates to our story at the | document info item level. What's different in 1.1? 1. There are more Unicode characters in XML Names. 2. You can't have C1 control characters if they're unescaped. 3. You can have C0 control characters (except NUL) if they're escaped. 4. The NEL character is normalized to LF in text content. 5. There's a more explicit nod towards Unicode normalization How does this impact our definition of "the same"? 1. Either two names use the exact same characters or they don't. No 1.0 name will ever "accidentally" be the same as a 1.1 name. 2-4. Are all serialization issues. My proposed equality function doesn't detect the difference between "&65;" and "A" so I don't see why it should care about how the C0 or C1 controls are presented in the serialized form (if it ever existed). One possible difference is that a NEL from an IBM mainframe might have been normalized to LF. But the whole point of that change is to allow NEL to be treated like LF. If two documents are the same after that transformation, then they were always meant to be the same. If I have NEL characters that aren't newlines in my non-mainframe application, they'll still be NEL in the infoset unless I move to 1.1, and I better solve this problem before I move to 1.1, so I think it's still reasonable for the infoset equality function to ignore this as a serialization issue. 5. If one document is normalized and the other isn't, they won't be equal. But I bet the I18N folks think you SHOULD normalize XML 1.0 documents too. In other words, all of the changes between 1.0 and 1.1 are exposed in other properties. No 1.1 document that exercises any of these differences will ever be the same as a 1.0 document that does not. Be seeing you, norm -- Norman.Walsh@Sun.COM / XML Standards Architect / Sun Microsystems, Inc. NOTICE: This email message is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message.
Attachments
- application/octet-stream attachment: att89cpy.dat
Received on Tuesday, 29 June 2004 15:16:29 UTC