W3C home > Mailing lists > Public > public-i18n-core@w3.org > April to June 2005

RE: [CSS21] uri()

From: Martin Duerst <duerst@it.aoyama.ac.jp>
Date: Wed, 04 May 2005 10:27:38 +0900
Message-Id: <>
To: Ian Hickson <ian@hixie.ch>, Addison Phillips <addison.phillips@quest.com>
Cc: Richard Ishida <ishida@w3.org>, www-style@w3.org, public-i18n-core@w3.org

Hello Ian, others,

At 06:54 05/05/04, Ian Hickson wrote:
 >On Tue, 3 May 2005, Addison Phillips wrote:

I agree with what Addison wrote.

 >> I believe that you are confusing two different issues here.
 >It's possible, but I don't think so.
 >> IDN imposes particular restrictions on what can be registered in a
 >> domain name

[Which is irrelevant for CSS.]

 >>(and what user-agents must do when requesting content from a
 >> non-ASCII domain name in order to encode the characters for DNS).


 >It also says how to encode a domain name given a particular string of
 >unicode codepoints, and requires (indirectly) that you use NFKC.


 >> IRI, by contrast, describes how to use non-ASCII characters in URIs.
 >> While domain names are certainly a part of a URI, their handling in IRI
 >> has nothing to do with Stringprep or Punycode. Implementations of one
 >> are not coupled to implementations of the other necessarily.
 >Correct. IRIs also says how to encode a particular string of unicode
 >codepoints, and in certain contexts requires that you use NFC.
 >It is quite possible to implement this (using NFKC for the domain name and
 >NFC for the path), my point here was just that it required that UAs have
 >both the NFKC and NFC tables to do it, which seems excessive given the
 >very limited resources some UAs may have (especially on small devices).

I think that having the tables can easily create a burden on small devices.
I don't think that the fact that both NFKC and NFC is needed is such a
big problem; the decomposition tables needed for NFC are a subset of those
for NFKC, and the composition tables are the same.

 >> Items 1b and 1c in Section 3 of IRI basically say that URIs should use
 >> the UTF-8 encoding for percent escaping non-ASCII characters (making
 >> them reliable, whereas today they might NOT be encoded using UTF-8,
 >> causing interoperability woes). This requires no special knowledge of
 >> Unicode, merely conversion between the native encoding and Unicode (when
 >> a native, legacy encoding is used for the stylesheet). Since the
 >> character set for HTML and XML is Unicode and since Unicode holds at
 >> least a special place in CSS, this doesn't strike me as an insuperable
 >> burden.
 >The problem is the _difference_ between points 1b and 1c, which, in
 >several implementations of CSS, would currently be impossible to implement
 >in any sensible manner.
 >For an example of the difference, see:
 >   http://lists.w3.org/Archives/Public/www-style/2005Mar/0102

This example is well worked out, but quite theoretical.
The main encoding/language where the difference between 1b and 1c
is really important is windows-1258 and Vietnamese. If you want
that case to work for the user, you have to look at normalization,
one way or another.

 >This would be nigh-on-impossible for UAs to sanely implement because it
 >would require character encoding information to be propagated through the
 >implementation into parts of the code that are completely unrelated to the
 >parsing of the original document (e.g. the DOM code).

The original character encoding is part of the Infoset. See point 6 at
Given this, I don't think that claims like "nigh-on-impossible" are
justified. Getting at that info may be a bit of a hack in some
implementations, but "nigh-on-impossible" it is not.

 >My personal guess would be that when IRIs are formally added to CSS
 >(probably in CSS3's Values and Units module), the specification will state
 >that regardless of the stylesheet's original encoding, the IRIs should
 >consider themselves to be in a Unicode environment. However, this would
 >technically be in violation of the IRI RFC.

Well, the IRI RFC, as well as the IDN-related RFCs, are still in Draft
Stage. Feedback on what parts are easy or difficult to implement is
definitely welcome. I can immagine that a future version of these
specs would e.g. contain some provision for small devices to not
do the normalization/nameprep, if there is enough feedback in this

But such feedback should not be used to throw
out the baby with the bathwater. Making it clear that CSS interprets
IRIs using UTF-8, rather than the encoding of the document (such
implementations still exist, although you claim that getting at
that encoding is "nigh-on-impossible") is a very high priority.
Getting the normalization stuff right is quite desirable, too,
but less of a priority, because the number of URIs/IRIs affected
is way, way smaller.

Regards,     Martin. 
Received on Saturday, 7 May 2005 06:07:09 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Wednesday, 1 October 2008 10:18:49 GMT