W3C home > Mailing lists > Public > public-i18n-core@w3.org > January to March 2009

Re: [CSS21][css3-namespace][css3-page][css3-selectors][css3-content] Unicode Normalization

From: Jonathan Kew <jonathan@jfkew.plus.com>
Date: Fri, 30 Jan 2009 00:41:30 +0000
Cc: public-i18n-core@w3.org, www-style@w3.org
Message-Id: <61503A76-E0B5-4E83-B9B3-4F9E4B54A54A@jfkew.plus.com>
To: "L. David Baron" <dbaron@dbaron.org>

On 29 Jan 2009, at 21:55, L. David Baron wrote:

>
> On Thursday 2009-01-29 13:27 -0800, fantasai wrote:
>> Thanks for the tests and the report, Richard. Going from that, I  
>> think it
>> makes sense to require /not/ Unicode-normalizing CSS. It may be a bit
>> confusing indeed for people working in Vietnamese and other such  
>> languages,
>> but on the other hand behavior across browsers is interoperable  
>> right now.
>> If one browser started normalizing, then someone testing in that  
>> browser
>> would not notice that the page is broken in other UAs.
>

I'm inclined to disagree. According to http://www.w3.org/TR/charmod-norm/#C312 
  (yes, I realize this is still a draft document):

<quote>
String identity matching must be performed as if the following steps  
were followed:
 Early uniform normalization to fully-normalized form, as defined in  
3.2.4 Fully-normalized text. In accordance with section 3  
Normalization, this step must be performed by the producers of the  
strings to be compared.
 Conversion to a common Unicode encoding form, if necessary.
 Expansion of all recognized character escapes and includes.
 Testing for bit-by-bit identity.
</quote>

See also the Unicode standard itself, for example conformance clause  
C6 (pp 71-72):

<quote>
C6 A process shall not assume that the interpretations of two  
canonical-equivalent charac-
ter sequences are distinct.
 The implications of this conformance clause are twofold. First, a  
process is
never required to give different interpretations to two different, but  
canonical-
equivalent character sequences. Second, no process can assume that  
another
process will make a distinction between two different, but canonical- 
equivalent
character sequences.
</quote>

The view that two canonically-equivalent sequences of Unicode  
codepoints are interchangeable representations of the same textual  
data, and should behave as such, is a pretty fundamental aspect of the  
encoding standard. I don't think users will be well served by a  
situation where apparently-identical strings do not necessarily match.

> In what parts of CSS might you want unicode normalization to be
> done?  The only case I can think of is selector matching that
> compares attribute values (or, in the future, text content). And
> even then it seems like it might be helpful in some cases and
> harmful in others.
>
> (I tend to think we probably don't want it for selectors, both since
> we currently have interoperability, and since selectors are
> particularly performance-sensitive.)

"Interoperability" based on non-normalization is an illusion that may  
work much of the time but will fail in apparently mysterious ways.  
Suppose you have an HTML page and associated CSS file that has non- 
normalized selectors, and browsers don't normalize when matching. It  
works fine. Later someone edits the CSS with a text editor that  
automatically normalizes during file open/save (which it would be  
fully entitled to do, according to the Unicode standard).... you can't  
see what's broken, but suddenly the CSS and the HTML have different  
forms for the "same" selector, and it breaks. Or suppose the HTML  
passes through a process -- perhaps a content filter of some kind --  
that applies normalization, but passes the CSS through unchanged  
because it doesn't care about inspecting it.

I doubt performance would be a problem, as in practice selectors will  
virtually always be in NFC (most often simple ASCII), and you can  
verify this very cheaply at the same time as doing a "naive" string  
equality test; then you'd only have to fall back on a more expensive  
path in extremely rare cases.

(Maybe it's time for a test implementation and some performance  
figures....)

JK
Received on Friday, 30 January 2009 00:42:20 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Friday, 30 January 2009 00:42:21 GMT