Re: Some issues with the IRI document [e9notutf8-05] from Martin Duerst on 2003-04-16 (public-iri@w3.org from April 2003)

From: Martin Duerst <duerst@w3.org>
Date: Wed, 16 Apr 2003 14:18:29 -0400
To: Paul Hoffman / IMC <phoffman@imc.org>, public-iri@w3.org
Message-Id: <4.2.0.58.J.20030416135237.05490650@localhost>
At 19:48 03/04/15 -0700, Paul Hoffman / IMC wrote:

>>The text in that paragraph read
>>
>>    For example, for a document with a URI of
>>    http://www.example.org/r%C3%A9sum%C3%A9.html, it is possible to
>>    construct a corresponding IRI (in XML notation, see Section 1.4):
>>    http://www.example.org/r&#xe9;sum&#xe9;.html (&#xe9; stands for the
>>    e-acute character, and is the UTF-8 encoded and escaped
>>    representation of that character).  On the other hand, for a document
>>    with an URI of http://www.example.org/r%E9sum%E9.html, the escaped
>>    octets cannot be converted to actual characters in an IRI, because
>>    the escaping is based on iso-8859-1 rather than UTF-8.
>>
>>The text in parentheses should have read:
>>
>>    (&#xe9; stands for the e-acute character, and %C3%A9 is the UTF-8
>>    encoded and escaped representation of that character)
>>
>>I have fixed that in my internal copy. Do you think that this change
>>helps you to understand the paragraph better?
>
>Only a little. It still makes me think that you are talking about an encoding.

Well, of course I'm talking about an encoding, namely UTF-8, here.
This is done in a very specific sense.


>Look at the paragraph that precedes this one:
>     In cases and for pieces where an encoding other than UTF-8 is used,
>     and for raw binary data encoded in URIs (see [RFC2397]), the octets
>     have to be %-escaped.  In these situations, the ability of IRIs to
>     directly represent a wide character repertoire cannot be used.
>How do you know the encoding of the URI? How can you tell if it is UTF-8 
>(and therefore convertible to an IRI) or something else?
>
>Asked another way, if I'm writing an IRI converter, how do I know that 
>this is OK:
>    http://www.example.org/r%C3%A9sum%C3%A9.html
>But this isn't:
>    http://www.example.org/r%E9sum%E9.html

A converter always converts from one thing to another. I guess you mean
an URI-to-IRI converter, and you are asking whether the examples above
can be converted to http://www.example.org/r<eacute>sum<eacute>.html.


>Is is simply because the second one fails a UTF8-decode test? What about 
>characters from other encondings that have values that are the same as 
>valid UTF8 values?

Section 1.2 is just a general statement of applicability. It doesn't
tell you how to convert from URIs to IRIs. That's done in Section 3.2.

What section 1.2 says is just that in order to be able to use an IRI
in a specific case, the characters on the server that one would want
to directly expose in the IRI actually have to be exposed via UTF-8
in the corresponding URI. In other words, if I have a file named
"r<eacute>sum<eacute>.html" on my server, I can only put
"r<eacute>sum<eacute>.html" into the IRI if "r<eacute>sum<eacute>.html"
is actually exposed as r%C3%A9sum%C3%A9.html in the corresponding
URI.

Section 3.1 indeed says that if you have
http://www.example.org/r%E9sum%E9.html, and you try to convert it to
an IRI, you still end up with http://www.example.org/r%E9sum%E9.html.
If you have http://www.example.org/r%C3%A9sum%C3%A9.html, and try
to convert it to an IRI, you will end up with
http://www.example.org/r<eacute>sum<eacute>.html. Because of the
regularity of UTF-8, there is a high chance that this corresponds
to the actual characters on the back side, but this is not guaranteed,
the same way you have no guarantee that http://www.example.org/A
actually corresponds to an "A" on the server side. The only guarantee
you have is that the "A" stands for octet %41, and that the server
will know what that octet corresponds to.

Does that explain things better?

Regards,    Martin.
Received on Wednesday, 16 April 2003 15:09:03 UTC