- From: Jeremy Carroll <jjc@hplb.hpl.hp.com>
- Date: Thu, 18 Oct 2001 09:54:58 +0100
- To: w3c-rdfcore-wg@w3.org
Jeremy wrote: > >Internally XML documents are in Unicode, [ ... snip ... ] gk: > I'm not sure what it means to say "Internally XML documents are in Unicode" > .. I though the XML was essentially a serialization syntax (for a labelled > and annotated tree structure). Yes sorry I was less than clear. The bit of the XML spec I was thinking of is section 2.2. [[[ A parsed entity contains text, a sequence of characters, [ ... ] A character is an atomic unit of text as specified by ISO/IEC 10646 ]]] My understanding of XML's relationship to charset's is that there is an early conversion into unicode. gk: > > Yes, I agree this should go in the syntax document. > But I note that RFC 2396 recognizes three different presentations of a URI: > original character sequence > octet sequence > URI character sequence > Hmmm, you got me reaching for my copy of RFC2396 ... [[[ There are two mappings, one from URI characters to octets, and a second from octets to original characters: URI character sequence->octet sequence->original character sequence ]]] My understanding is that URI is generally understood as a URI character sequence in US ASCII. I also note that the left-to-right arrows in the quoted text are wishful thinking (because the charset is undefined), while from right-to-left there is a well-defined idempotent algorithm (using either the known charset or UTF-8). gk: > I think we'd need to be clear which of these is intended, if that's > important. For the model theory, I'm not sure that it is important (as > long as its not a non-UTF-8 octet sequence). I think your parenthetical comment forces the graph and the model theory into using the "URI character sequence in US ASCII", as the fixed point. Hence my worry about the case of the hexadecimal escape characters. For the RDF/XML syntax the idempotency of the URI escaping algorithm means precisely that we don't need to be clear about which of these three is intended. This is good news because (document) backward compatibility is immediate. A document with URIs all written in the escaped notation in US-ASCII can be correctly read by a processor that is (re)running the escaping algorithm. Jeremy
Received on Thursday, 18 October 2001 04:50:14 UTC