Re: [CSS21][css3-namespace][css3-page][css3-selectors][css3-content] Unicode Normalization from Richard Ishida on 2009-02-04 (public-i18n-core@w3.org from January to March 2009)

From: Richard Ishida <ishida@w3.org>
Date: Wed, 4 Feb 2009 13:49:19 -0000
To: <public-i18n-core@w3.org>, "'W3C Style List'" <www-style@w3.org>
Message-ID: <007801c986cf$60db8040$229280c0$@org>

I'd like to see if I can summarise this discussion so far, in very high level terms. I hope this will not appear too simplistic.

It was suggested by people involved in i18n that, to support the growth of Web technologies in languages in the developing world, it would be useful for users of CSS if CSS matched names in selectors and class names during lookup that were canonically equivalent but used different sequences of Unicode characters. This may apply to other matching operations too. It was proposed that this would be particularly useful for non-Western languages, especially if different people are involved in developing the style sheet and markup, because different editing tools output canonically equivalent text in different ways.

Implementers are concerned about the impact on performance of normalizing before comparing such strings.

Implementers asked whether this is a real issue. Several people responded with examples to say they believe it is.

Some discussion centred around when the normalization would be done. Normalization while a file is parsed (ie. 'early' normalization during encoding conversion, etc.) would appear to impact performance much less than normalization on the fly. There are concerns, however, that normalizing content, rather than just code, would be inappropriate. Also, the user agent would need to normalize the markup as well as the style sheet, so that both match, so this is a bigger issue than just CSS. There was an additional question about the legality of normalizing XML markup for internal representation. Further, some thought will need to be given to how this affects dynamic manipulation of code, eg. using JavaScript.

It was also said that matching canonically equivalent text is required by Unicode compliant implementations anyway, independent of the needs of users, though other quarters seem to dispute that.

Some people have proposed that the problem is best fixed by just ensuring that editing tools output normalized code, so that the user agent doesn't need to normalize. Others have said that this can't be mandated or controlled, and in practice is not happening, so the problem needs to be addressed in the user agents.

There were some spin-off discussions about which normalization form is best, and the differences between utf-8 and utf-16.

Is that a fair representation?

============
Richard Ishida
Internationalization Lead
W3C (World Wide Web Consortium)

http://www.w3.org/International/
http://rishida.net/

Received on Wednesday, 4 February 2009 13:49:19 UTC