Re: I18N (was: Closing rdfms-difference-between-ID-and-about) from Jeremy Carroll on 2001-10-18 (w3c-rdfcore-wg@w3.org from October 2001)

From: Jeremy Carroll <jjc@hplb.hpl.hp.com>
Date: Thu, 18 Oct 2001 09:54:58 +0100
To: w3c-rdfcore-wg@w3.org
Message-ID: <3BCE98E2.EB70DEE0@hplb.hpl.hp.com>

Jeremy wrote:
> >Internally XML documents are in Unicode,  [ ... snip ... ]
gk:
> I'm not sure what it means to say "Internally XML documents are in Unicode"
> .. I though the XML was essentially a serialization syntax (for a labelled
> and annotated tree structure).

Yes sorry I was less than clear.

The bit of the XML spec I was thinking of is section 2.2.
[[[
A parsed entity contains text, a sequence of characters, [ ... ]
A character is an atomic unit of text as specified by ISO/IEC 10646
]]]

My understanding of XML's relationship to charset's is that there is an
early conversion into unicode.

gk:
> 
> Yes, I agree this should go in the syntax document.
> But I note that RFC 2396 recognizes three different presentations of a URI:
>    original character sequence
>    octet sequence
>    URI character sequence
>

Hmmm, you got me reaching for my copy of RFC2396 ...

[[[
     There are two mappings, one from URI characters to octets, and
   a second from octets to original characters:

   URI character sequence->octet sequence->original character sequence

]]]

My understanding is that URI is generally understood as a URI character
sequence in US ASCII.
I also note that the left-to-right arrows in the quoted text are wishful
thinking (because the charset is undefined), while from right-to-left
there is a well-defined idempotent algorithm (using either the known
charset or UTF-8).

gk:
> I think we'd need to be clear which of these is intended, if that's
> important.  For the model theory, I'm not sure that it is important (as
> long as its not a non-UTF-8 octet sequence).

I think your parenthetical comment forces the graph and the model theory
into using the "URI character sequence in US ASCII", as the fixed point.
Hence my worry about the case of the hexadecimal escape characters.

For the RDF/XML syntax the idempotency of the URI escaping algorithm
means precisely that we don't need to be clear about which of these
three is intended. This is good news because (document) backward
compatibility is immediate. A document with URIs all written in the
escaped notation in US-ASCII can be correctly read by a processor that
is (re)running the escaping algorithm.

Jeremy

Received on Thursday, 18 October 2001 04:50:14 UTC