- From: Martin J. Dürst <duerst@it.aoyama.ac.jp>
- Date: Fri, 22 May 2009 16:41:24 +0900
- To: Julian Reschke <julian.reschke@gmx.de>
- CC: Anne van Kesteren <annevk@opera.com>, Dan Connolly <connolly@w3.org>, "www-tag@w3.org" <www-tag@w3.org>, public-iri@w3.org
cc'ing public-iri@w3.org On 2009/05/22 2:44, Julian Reschke wrote: > Anne van Kesteren wrote: >> Just for the record, around the same time I pointed out that this >> could not work because of Step 1b in section 3.1 of RFC 3987. This may >> or may not be a bug in RFC 3987, but it is most definitely an issue. > > I apologize that I keep forgetting this issue; for the record it is this > one > > b. If the IRI is in some digital representation (e.g., an > octet stream) in some known non-Unicode character > encoding, convert the IRI to a sequence of characters > from the UCS normalized according to NFC. > > -- <http://tools.ietf.org/html/rfc3987#section-3.1> > > ...which is weird, because the normalization is only enforced on > non-Unicode encodings. Seems this needs to be discussed in the context > of IRIbis. There is an issue for this: http://www.w3.org/International/iri-edit/#transcodeNFC-103 The original reason in RFC 3987 for enfocming normalization only for non-Unicode encodings is that you want to have a way to include non-normalized data in an IRI, in particular e.g in the query part, where it may e.g. be needed as an input to a normalization Web service. On the other hand, there are some non-Unicode (=legacy) encodings that by construction are not normalized; to increase the chance that users get what they see, normalization seemed appropriate. RFC 3987 contains some Vietnamese examples, please have a look at them. The above Step 1b (which has a MUST inherited from context) has been mentioned repeatedly as problematic. Several reasons have been give: - Normalization would have to occur at a stage when the document is already in some Unicode encoding internally, and so the original encoding may not be known anymore (this came from the CSS WG, as far as I remember). Abstractly, this may be true, but then see below. - Current implementations don't do this, so it shouldn't be specified this way. Also, implementing it may be rather heavy. (Björn Höhrmann, http://lists.w3.org/Archives/Public/public-iri/2005Jun/0000.html) This is true as far as I have tested (in particular for several browsers/browser versions on Windows). On the other hand, I know that libiconv uses a normalizing transcoder for windows-1252, which means that if any browser or other implementation uses libiconv, e.g. on Unix, then that implementation would do what RFC 3987 requires, at least for the case where it matters most (windows-1252 is used for Vietnamese). - Parts of the statement (e.g. "non-Unicode character encoding" are not defined as precisely as they might be (also Björn Höhrmann, same mail). I guess that's the smaller part of the problem; if we know what we want to specify overall, I'm sure we can work out the details. I have proposed to replace the MUST with a SHOULD for Step 1b (see http://lists.w3.org/Archives/Public/public-iri/2007Jul/0008.html and http://tools.ietf.org/html/draft-duerst-iri-bis-05#section-3.1). For this, I got a reply from Björn (see http://lists.w3.org/Archives/Public/public-iri/2007Jul/0009.html), but while that mail said that a SHOULD was wrong, Björn didn't say what would be right, in his opinion. From http://lists.w3.org/Archives/Public/public-iri/2005Jun/0000.html, I'd tend to conclude that Björn wanted no normalization, but http://lists.w3.org/Archives/Public/public-iri/2007Jul/0009.html seems to push back for a MUST. Any help, either in interpreting Björn's mail or on the issue as such, is appreciated greatly. >>> 1) non-IRI characters found in the query part are encoded using the >>> document's character encoding, then percent-escaped (*) It seems to me that once the document is transcoded to an internal form, the original encoding may no longer be known. For the converse, if browsers can implement this, they would also not have a big problem to know whether to apply NFC or not for IRIs. Regards, Martin. -- #-# Martin J. Dürst, Professor, Aoyama Gakuin University #-# http://www.sw.it.aoyama.ac.jp mailto:duerst@it.aoyama.ac.jp
Received on Friday, 22 May 2009 07:52:22 UTC