RE: [CSS21] uri() from Ian Hickson on 2005-05-03 (www-style@w3.org from May 2005)

From: Ian Hickson <ian@hixie.ch>
Date: Tue, 3 May 2005 21:54:28 +0000 (UTC)
To: Addison Phillips <addison.phillips@quest.com>
Cc: Richard Ishida <ishida@w3.org>, www-style@w3.org, public-i18n-core@w3.org
Message-ID: <Pine.LNX.4.61.0505032145130.5094@dhalsim.dreamhost.com>

On Tue, 3 May 2005, Addison Phillips wrote:
> 
> I believe that you are confusing two different issues here.

It's possible, but I don't think so.

> IDN imposes particular restrictions on what can be registered in a 
> domain name (and what user-agents must do when requesting content from a 
> non-ASCII domain name in order to encode the characters for DNS).

It also says how to encode a domain name given a particular string of 
unicode codepoints, and requires (indirectly) that you use NFKC.

> IRI, by contrast, describes how to use non-ASCII characters in URIs. 
> While domain names are certainly a part of a URI, their handling in IRI 
> has nothing to do with Stringprep or Punycode. Implementations of one 
> are not coupled to implementations of the other necessarily.

Correct. IRIs also says how to encode a particular string of unicode 
codepoints, and in certain contexts requires that you use NFC.

It is quite possible to implement this (using NFKC for the domain name and 
NFC for the path), my point here was just that it required that UAs have 
both the NFKC and NFC tables to do it, which seems excessive given the 
very limited resources some UAs may have (especially on small devices).

> Items 1b and 1c in Section 3 of IRI basically say that URIs should use 
> the UTF-8 encoding for percent escaping non-ASCII characters (making 
> them reliable, whereas today they might NOT be encoded using UTF-8, 
> causing interoperability woes). This requires no special knowledge of 
> Unicode, merely conversion between the native encoding and Unicode (when 
> a native, legacy encoding is used for the stylesheet). Since the 
> character set for HTML and XML is Unicode and since Unicode holds at 
> least a special place in CSS, this doesn't strike me as an insuperable 
> burden.

The problem is the _difference_ between points 1b and 1c, which, in 
several implementations of CSS, would currently be impossible to implement 
in any sensible manner.

For an example of the difference, see:

   http://lists.w3.org/Archives/Public/www-style/2005Mar/0102

This would be nigh-on-impossible for UAs to sanely implement because it 
would require character encoding information to be propagated through the 
implementation into parts of the code that are completely unrelated to the 
parsing of the original document (e.g. the DOM code).

My personal guess would be that when IRIs are formally added to CSS 
(probably in CSS3's Values and Units module), the specification will state 
that regardless of the stylesheet's original encoding, the IRIs should 
consider themselves to be in a Unicode environment. However, this would 
technically be in violation of the IRI RFC.

-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'

Received on Tuesday, 3 May 2005 21:54:49 UTC