Re: XML documents are character sequences, not byte sequences from pat hayes on 2002-12-17 (w3c-rdfcore-wg@w3.org from December 2002)

From: pat hayes <phayes@ai.uwf.edu>
Date: Tue, 17 Dec 2002 10:30:03 -0600
To: Dan Connolly <connolly@w3.org>, Jeremy Carroll <jjc@hpl.hp.com>
Cc: w3c-rdfcore-wg@w3.org
Message-Id: <p05111b0aba24fff03927@[10.0.100.86]>

Guys, the thought occurs to me that if it would help, it would be 
easy to tweak the decision about literals denoting themselves. That 
doesn't have to be taken quite as strictly as it comes out to be when 
we get this fine-grained. I mean, for example if a literal has to be 
a byte sequence then what it denotes could be something more 
abstract, say, like a character sequence. Let me know if you want any 
tweaking done.

Pat
--------

>On Tue, 2002-12-17 at 07:25, Jeremy Carroll wrote:
>>  Dan
>>
>>  I would value your consideration on one motivation for the original text:
>>
>>  canonical XML documents *are* UTF-8 encoded.
>
>Hmm... the c14n spec is no more clear than the XML spec on this point:
>
>"The canonical form of an XML document is physical representation of the
>document produced by the method described in this specification. The
>changes are summarized in the following list:
>
>       * The document is encoded in UTF-8"
>
>  -- http://www.w3.org/TR/xml-c14n
>
>I consider utf-8 to be a function from character sequences to byte
>sequences.
>(cf http://www.w3.org/TR/2002/WD-charmod-20020430/#sec-Digital)
>
>Now while characters aren't directly specified to be distinct
>from bytes, we know that there are more characters than there
>are bytes, so we can't identify character sequences with byte
>sequences. So in general (i.e. for at least one value
>of chars) utf-8(chars) <> chars.
>
>So either an XML document is a sequence of characters or
>a sequence of bytes (or perhaps something else) but not
>both.
>
>One could say that XML documents are character sequences,
>but the canonical form of an XML document is a byte
>sequence. But the c14n spec seems to say that canonical
>forms are a subset of XML documents.
>
>One rational way out is to view an XML document as
>a pair (charseq, encoding); together these determine
>a byte sequence: encoding(charset).
>
>Then we could constrain canonical
>XML documents so that the encoding is always utf-8.
>
>
>>  This was what originally motivated the constraint in some other text for
>>  which it was relevant and which got copy pasted into the current context
>>  where it is not so relevant.
>>
>>  I agree with your analysis that the UTF-8 encoding is probably wrong in the
>>  current text; however it does mean that the lexical space consists of
>>  strings, the mapping goes via strings considered as documents, and then the
>>  value space is documents considered as byte streams.
>>
>
>Hmm... "considered as"... that's the sort of phrase that gets
>us into trouble. The string "abc" is what it is, regardless
>of what it's considered as. It would be coherent to say
>	the value space is byte streams that result from
>	encoding strings
>
>But as I say above, I think the only rational approach is to look
>at XML documents as pairs (charseq, encoding) which determine
>a byte sequence.
>
>
>>  Hmmm,
>>
>>  (Not trying to change anything back - simply sharing some thoughts)
>>
>>  Jeremy
>--
>Dan Connolly, W3C http://www.w3.org/People/Connolly/


-- 
---------------------------------------------------------------------
IHMC					(850)434 8903 or (650)494 3973   home
40 South Alcaniz St.			(850)202 4416   office
Pensacola              			(850)202 4440   fax
FL 32501           				(850)291 0667    cell
phayes@ai.uwf.edu	          http://www.coginst.uwf.edu/~phayes
s.pam@ai.uwf.edu   for spam

Received on Tuesday, 17 December 2002 11:30:10 UTC