Re: [CSS21][css3-namespace][css3-page][css3-selectors][css3-content] Unicode Normalization

On Wed, 04 Feb 2009 14:49:19 +0100, Richard Ishida <ishida@w3.org> wrote:
> I'd like to see if I can summarise this discussion so far, in very high  
> level terms.  I hope this will not appear too simplistic.

Thanks!


> It was suggested by people involved in i18n that, to support the growth  
> of Web technologies in languages in the developing world, it would be  
> useful for users of CSS if CSS matched names in selectors and class  
> names during lookup that were canonically equivalent but used different  
> sequences of Unicode characters. This may apply to other matching  
> operations too. It was proposed that this would be particularly useful  
> for non-Western languages, especially if different people are involved  
> in developing the style sheet and markup, because different editing  
> tools output canonically equivalent text in different ways.
>
> Implementers are concerned about the impact on performance of  
> normalizing before comparing such strings.
>
> Implementers asked whether this is a real issue.  Several people  
> responded with examples to say they believe it is.

I also asked for research rather than just examples. What has been  
demonstrated (though not quite as coherent as I would like) is that  
different input systems produce different normalized forms of Unicode.  
What has not been demonstrated is whether this is a problem in practice.  
What also has not been demonstrated is that people use IDs and class  
names, and create XML element and attribute names on the Web, on which  
this would have an impact.


> Some discussion centred around when the normalization would be done.   
> Normalization while a file is parsed (ie. 'early' normalization during  
> encoding conversion, etc.) would appear to impact performance much less  
> than normalization on the fly. There are concerns, however, that  
> normalizing content, rather than just code, would be inappropriate.  
> Also, the user agent would need to normalize the markup as well as the  
> style sheet, so that both match, so this is a bigger issue than just  
> CSS.  There was an additional question about the legality of normalizing  
> XML markup for internal representation.  Further, some thought will need  
> to be given to how this affects dynamic manipulation of code, eg. using  
> JavaScript.
>
> It was also said that matching canonically equivalent text is required  
> by Unicode compliant implementations anyway, independent of the needs of  
> users, though other quarters seem to dispute that.

I think Martin Dürst quoted text from the Unicode specification to that  
effect so I do not think this is an open question. Just a misunderstanding  
of the requirements Unicode imposes.


> Some people have proposed that the problem is best fixed by just  
> ensuring that editing tools output normalized code, so that the user  
> agent doesn't need to normalize.  Others have said that this can't be  
> mandated or controlled, and in practice is not happening, so the problem  
> needs to be addressed in the user agents.

In practice nothing is happening in user agents either, as far as I can  
tell.


> There were some spin-off discussions about which normalization form is  
> best, and the differences between utf-8 and utf-16.
>
> Is that a fair representation?

Pretty much, with the above remarks.


-- 
Anne van Kesteren
<http://annevankesteren.nl/>
<http://www.opera.com/>

Received on Wednesday, 4 February 2009 15:40:28 UTC