Re: a few URI/href issues captured with test cases from Martin J. Dürst on 2009-05-22 (public-iri@w3.org from May 2009)

From: Martin J. Dürst <duerst@it.aoyama.ac.jp>
Date: Fri, 22 May 2009 16:41:24 +0900
To: Julian Reschke <julian.reschke@gmx.de>
CC: Anne van Kesteren <annevk@opera.com>, Dan Connolly <connolly@w3.org>, "www-tag@w3.org" <www-tag@w3.org>, public-iri@w3.org
Message-ID: <4A165724.9080208@it.aoyama.ac.jp>

cc'ing public-iri@w3.org

On 2009/05/22 2:44, Julian Reschke wrote:
> Anne van Kesteren wrote:

>> Just for the record, around the same time I pointed out that this
>> could not work because of Step 1b in section 3.1 of RFC 3987. This may
>> or may not be a bug in RFC 3987, but it is most definitely an issue.
>
> I apologize that I keep forgetting this issue; for the record it is this
> one
>
> b. If the IRI is in some digital representation (e.g., an
> octet stream) in some known non-Unicode character
> encoding, convert the IRI to a sequence of characters
> from the UCS normalized according to NFC.
>
> -- <http://tools.ietf.org/html/rfc3987#section-3.1>
>
> ...which is weird, because the normalization is only enforced on
> non-Unicode encodings. Seems this needs to be discussed in the context
> of IRIbis.

There is an issue for this:
http://www.w3.org/International/iri-edit/#transcodeNFC-103

The original reason in RFC 3987 for enfocming normalization only for 
non-Unicode encodings is that you want to have a way to include 
non-normalized data in an IRI, in particular e.g in the query part, 
where it may e.g. be needed as an input to a normalization Web service.
On the other hand, there are some non-Unicode (=legacy) encodings that 
by construction are not normalized; to increase the chance that users 
get what they see, normalization seemed appropriate. RFC 3987 contains 
some Vietnamese examples, please have a look at them.

The above Step 1b (which has a MUST inherited from context) has been 
mentioned repeatedly as problematic. Several reasons have been give:

- Normalization would have to occur at a stage when the document is 
already in some Unicode encoding internally, and so the original 
encoding may not be known anymore (this came from the CSS WG, as far as 
I remember). Abstractly, this may be true, but then see below.

- Current implementations don't do this, so it shouldn't be specified 
this way. Also, implementing it may be rather heavy. (Björn Höhrmann,
http://lists.w3.org/Archives/Public/public-iri/2005Jun/0000.html)
This is true as far as I have tested (in particular for several 
browsers/browser versions on Windows). On the other hand, I know that 
libiconv uses a normalizing transcoder for windows-1252, which means 
that if any browser or other implementation uses libiconv, e.g. on Unix, 
then that implementation would do what RFC 3987 requires, at least for 
the case where it matters most (windows-1252 is used for Vietnamese).

- Parts of the statement (e.g. "non-Unicode character encoding" are not 
defined as precisely as they might be (also Björn Höhrmann, same mail).
I guess that's the smaller part of the problem; if we know what we want 
to specify overall, I'm sure we can work out the details.

I have proposed to replace the MUST with a SHOULD for Step 1b (see
http://lists.w3.org/Archives/Public/public-iri/2007Jul/0008.html and
http://tools.ietf.org/html/draft-duerst-iri-bis-05#section-3.1).

For this, I got a reply from Björn (see
http://lists.w3.org/Archives/Public/public-iri/2007Jul/0009.html), but 
while that mail said that a SHOULD was wrong, Björn didn't say what 
would be right, in his opinion. From 
http://lists.w3.org/Archives/Public/public-iri/2005Jun/0000.html, I'd 
tend to conclude that Björn wanted no normalization, but 
http://lists.w3.org/Archives/Public/public-iri/2007Jul/0009.html seems 
to push back for a MUST.

Any help, either in interpreting Björn's mail or on the issue as such, 
is appreciated greatly.

>>> 1) non-IRI characters found in the query part are encoded using the
>>> document's character encoding, then percent-escaped (*)

It seems to me that once the document is transcoded to an internal form, 
the original encoding may no longer be known. For the converse, if 
browsers can implement this, they would also not have a big problem to 
know whether to apply NFC or not for IRIs.

Regards,    Martin.

-- 
#-# Martin J. Dürst, Professor, Aoyama Gakuin University
#-# http://www.sw.it.aoyama.ac.jp   mailto:duerst@it.aoyama.ac.jp

Received on Friday, 22 May 2009 07:52:22 UTC