RE: [CSS21] uri() from Martin Duerst on 2005-05-04 (public-i18n-core@w3.org from April to June 2005)

From: Martin Duerst <duerst@it.aoyama.ac.jp>
Date: Wed, 04 May 2005 10:27:38 +0900
To: Ian Hickson <ian@hixie.ch>, Addison Phillips <addison.phillips@quest.com>
Cc: Richard Ishida <ishida@w3.org>, www-style@w3.org, public-i18n-core@w3.org
Message-Id: <6.0.0.20.2.20050504100844.06797c00@itmail.it.aoyama.ac.jp>
Hello Ian, others,

At 06:54 05/05/04, Ian Hickson wrote:
 >
 >On Tue, 3 May 2005, Addison Phillips wrote:

I agree with what Addison wrote.

 >> I believe that you are confusing two different issues here.
 >
 >It's possible, but I don't think so.
 >
 >
 >> IDN imposes particular restrictions on what can be registered in a
 >> domain name

[Which is irrelevant for CSS.]

 >>(and what user-agents must do when requesting content from a
 >> non-ASCII domain name in order to encode the characters for DNS).

Yes.

 >It also says how to encode a domain name given a particular string of
 >unicode codepoints, and requires (indirectly) that you use NFKC.

Yes.

 >> IRI, by contrast, describes how to use non-ASCII characters in URIs.
 >> While domain names are certainly a part of a URI, their handling in IRI
 >> has nothing to do with Stringprep or Punycode. Implementations of one
 >> are not coupled to implementations of the other necessarily.
 >
 >Correct. IRIs also says how to encode a particular string of unicode
 >codepoints, and in certain contexts requires that you use NFC.
 >
 >It is quite possible to implement this (using NFKC for the domain name and
 >NFC for the path), my point here was just that it required that UAs have
 >both the NFKC and NFC tables to do it, which seems excessive given the
 >very limited resources some UAs may have (especially on small devices).

I think that having the tables can easily create a burden on small devices.
I don't think that the fact that both NFKC and NFC is needed is such a
big problem; the decomposition tables needed for NFC are a subset of those
for NFKC, and the composition tables are the same.

 >> Items 1b and 1c in Section 3 of IRI basically say that URIs should use
 >> the UTF-8 encoding for percent escaping non-ASCII characters (making
 >> them reliable, whereas today they might NOT be encoded using UTF-8,
 >> causing interoperability woes). This requires no special knowledge of
 >> Unicode, merely conversion between the native encoding and Unicode (when
 >> a native, legacy encoding is used for the stylesheet). Since the
 >> character set for HTML and XML is Unicode and since Unicode holds at
 >> least a special place in CSS, this doesn't strike me as an insuperable
 >> burden.
 >
 >The problem is the _difference_ between points 1b and 1c, which, in
 >several implementations of CSS, would currently be impossible to implement
 >in any sensible manner.
 >
 >For an example of the difference, see:
 >
 >   http://lists.w3.org/Archives/Public/www-style/2005Mar/0102

This example is well worked out, but quite theoretical.
The main encoding/language where the difference between 1b and 1c
is really important is windows-1258 and Vietnamese. If you want
that case to work for the user, you have to look at normalization,
one way or another.

 >This would be nigh-on-impossible for UAs to sanely implement because it
 >would require character encoding information to be propagated through the
 >implementation into parts of the code that are completely unrelated to the
 >parsing of the original document (e.g. the DOM code).

The original character encoding is part of the Infoset. See point 6 at
http://www.w3.org/TR/2004/REC-xml-infoset-20040204/#infoitem.document.
Given this, I don't think that claims like "nigh-on-impossible" are
justified. Getting at that info may be a bit of a hack in some
implementations, but "nigh-on-impossible" it is not.

 >My personal guess would be that when IRIs are formally added to CSS
 >(probably in CSS3's Values and Units module), the specification will state
 >that regardless of the stylesheet's original encoding, the IRIs should
 >consider themselves to be in a Unicode environment. However, this would
 >technically be in violation of the IRI RFC.

Well, the IRI RFC, as well as the IDN-related RFCs, are still in Draft
Stage. Feedback on what parts are easy or difficult to implement is
definitely welcome. I can immagine that a future version of these
specs would e.g. contain some provision for small devices to not
do the normalization/nameprep, if there is enough feedback in this
direction.

But such feedback should not be used to throw
out the baby with the bathwater. Making it clear that CSS interprets
IRIs using UTF-8, rather than the encoding of the document (such
implementations still exist, although you claim that getting at
that encoding is "nigh-on-impossible") is a very high priority.
Getting the normalization stuff right is quite desirable, too,
but less of a priority, because the number of URIs/IRIs affected
is way, way smaller.


Regards,     Martin.
Received on Saturday, 7 May 2005 06:07:09 UTC