Re: XML documents are character sequences, not byte sequences from Dan Connolly on 2002-12-17 (w3c-rdfcore-wg@w3.org from December 2002)

From: Dan Connolly <connolly@w3.org>
Date: 17 Dec 2002 08:42:36 -0600
To: Jeremy Carroll <jjc@hpl.hp.com>
Cc: w3c-rdfcore-wg@w3.org
Message-Id: <1040136156.11347.143.camel@dirk.dm93.org>

On Tue, 2002-12-17 at 07:25, Jeremy Carroll wrote:
> Dan
> 
> I would value your consideration on one motivation for the original text:
> 
> canonical XML documents *are* UTF-8 encoded.

Hmm... the c14n spec is no more clear than the XML spec on this point:

"The canonical form of an XML document is physical representation of the
document produced by the method described in this specification. The
changes are summarized in the following list:

      * The document is encoded in UTF-8"

 -- http://www.w3.org/TR/xml-c14n

I consider utf-8 to be a function from character sequences to byte
sequences.
(cf http://www.w3.org/TR/2002/WD-charmod-20020430/#sec-Digital)

Now while characters aren't directly specified to be distinct
from bytes, we know that there are more characters than there
are bytes, so we can't identify character sequences with byte
sequences. So in general (i.e. for at least one value
of chars) utf-8(chars) <> chars.

So either an XML document is a sequence of characters or
a sequence of bytes (or perhaps something else) but not
both.

One could say that XML documents are character sequences,
but the canonical form of an XML document is a byte
sequence. But the c14n spec seems to say that canonical
forms are a subset of XML documents.

One rational way out is to view an XML document as
a pair (charseq, encoding); together these determine
a byte sequence: encoding(charset).

Then we could constrain canonical
XML documents so that the encoding is always utf-8.

> This was what originally motivated the constraint in some other text for
> which it was relevant and which got copy pasted into the current context
> where it is not so relevant.
> 
> I agree with your analysis that the UTF-8 encoding is probably wrong in the
> current text; however it does mean that the lexical space consists of
> strings, the mapping goes via strings considered as documents, and then the
> value space is documents considered as byte streams.
> 

Hmm... "considered as"... that's the sort of phrase that gets
us into trouble. The string "abc" is what it is, regardless
of what it's considered as. It would be coherent to say
	the value space is byte streams that result from
	encoding strings

But as I say above, I think the only rational approach is to look
at XML documents as pairs (charseq, encoding) which determine
a byte sequence.

> Hmmm,
> 
> (Not trying to change anything back - simply sharing some thoughts)
> 
> Jeremy
-- 
Dan Connolly, W3C http://www.w3.org/People/Connolly/

Received on Tuesday, 17 December 2002 09:42:43 UTC