Re: Some issues with the IRI document [e9notutf8-05] from Paul Hoffman / IMC on 2003-04-16 (public-iri@w3.org from April 2003)

From: Paul Hoffman / IMC <phoffman@imc.org>
Date: Tue, 15 Apr 2003 19:48:05 -0700
To: public-iri@w3.org
Message-Id: <p0521063cbac274df5fe1@[142.131.246.132]>

>The text in that paragraph read
>
>    For example, for a document with a URI of
>    http://www.example.org/r%C3%A9sum%C3%A9.html, it is possible to
>    construct a corresponding IRI (in XML notation, see Section 1.4):
>    http://www.example.org/r&#xe9;sum&#xe9;.html (&#xe9; stands for the
>    e-acute character, and is the UTF-8 encoded and escaped
>    representation of that character).  On the other hand, for a document
>    with an URI of http://www.example.org/r%E9sum%E9.html, the escaped
>    octets cannot be converted to actual characters in an IRI, because
>    the escaping is based on iso-8859-1 rather than UTF-8.
>
>The text in parentheses should have read:
>
>    (&#xe9; stands for the e-acute character, and %C3%A9 is the UTF-8
>    encoded and escaped representation of that character)
>
>I have fixed that in my internal copy. Do you think that this change
>helps you to understand the paragraph better?

Only a little. It still makes me think that you are talking about an encoding.

Look at the paragraph that precedes this one:
     In cases and for pieces where an encoding other than UTF-8 is used,
     and for raw binary data encoded in URIs (see [RFC2397]), the octets
     have to be %-escaped.  In these situations, the ability of IRIs to
     directly represent a wide character repertoire cannot be used.
How do you know the encoding of the URI? How can you tell if it is 
UTF-8 (and therefore convertible to an IRI) or something else?

Asked another way, if I'm writing an IRI converter, how do I know 
that this is OK:
    http://www.example.org/r%C3%A9sum%C3%A9.html
But this isn't:
    http://www.example.org/r%E9sum%E9.html
Is is simply because the second one fails a UTF8-decode test? What 
about characters from other encondings that have values that are the 
same as valid UTF8 values?

--Paul Hoffman, Director
--Internet Mail Consortium

Received on Tuesday, 15 April 2003 22:48:20 UTC