- From: Dan Connolly <connolly@w3.org>
- Date: 17 Dec 2002 08:42:36 -0600
- To: Jeremy Carroll <jjc@hpl.hp.com>
- Cc: w3c-rdfcore-wg@w3.org
On Tue, 2002-12-17 at 07:25, Jeremy Carroll wrote: > Dan > > I would value your consideration on one motivation for the original text: > > canonical XML documents *are* UTF-8 encoded. Hmm... the c14n spec is no more clear than the XML spec on this point: "The canonical form of an XML document is physical representation of the document produced by the method described in this specification. The changes are summarized in the following list: * The document is encoded in UTF-8" -- http://www.w3.org/TR/xml-c14n I consider utf-8 to be a function from character sequences to byte sequences. (cf http://www.w3.org/TR/2002/WD-charmod-20020430/#sec-Digital) Now while characters aren't directly specified to be distinct from bytes, we know that there are more characters than there are bytes, so we can't identify character sequences with byte sequences. So in general (i.e. for at least one value of chars) utf-8(chars) <> chars. So either an XML document is a sequence of characters or a sequence of bytes (or perhaps something else) but not both. One could say that XML documents are character sequences, but the canonical form of an XML document is a byte sequence. But the c14n spec seems to say that canonical forms are a subset of XML documents. One rational way out is to view an XML document as a pair (charseq, encoding); together these determine a byte sequence: encoding(charset). Then we could constrain canonical XML documents so that the encoding is always utf-8. > This was what originally motivated the constraint in some other text for > which it was relevant and which got copy pasted into the current context > where it is not so relevant. > > I agree with your analysis that the UTF-8 encoding is probably wrong in the > current text; however it does mean that the lexical space consists of > strings, the mapping goes via strings considered as documents, and then the > value space is documents considered as byte streams. > Hmm... "considered as"... that's the sort of phrase that gets us into trouble. The string "abc" is what it is, regardless of what it's considered as. It would be coherent to say the value space is byte streams that result from encoding strings But as I say above, I think the only rational approach is to look at XML documents as pairs (charseq, encoding) which determine a byte sequence. > Hmmm, > > (Not trying to change anything back - simply sharing some thoughts) > > Jeremy -- Dan Connolly, W3C http://www.w3.org/People/Connolly/
Received on Tuesday, 17 December 2002 09:42:43 UTC