- From: Jonathan Kew <jonathan@jfkew.plus.com>
- Date: Fri, 30 Jan 2009 00:41:30 +0000
- To: "L. David Baron" <dbaron@dbaron.org>
- Cc: public-i18n-core@w3.org, www-style@w3.org
On 29 Jan 2009, at 21:55, L. David Baron wrote: > > On Thursday 2009-01-29 13:27 -0800, fantasai wrote: >> Thanks for the tests and the report, Richard. Going from that, I >> think it >> makes sense to require /not/ Unicode-normalizing CSS. It may be a bit >> confusing indeed for people working in Vietnamese and other such >> languages, >> but on the other hand behavior across browsers is interoperable >> right now. >> If one browser started normalizing, then someone testing in that >> browser >> would not notice that the page is broken in other UAs. > I'm inclined to disagree. According to http://www.w3.org/TR/charmod-norm/#C312 (yes, I realize this is still a draft document): <quote> String identity matching must be performed as if the following steps were followed: • Early uniform normalization to fully-normalized form, as defined in 3.2.4 Fully-normalized text. In accordance with section 3 Normalization, this step must be performed by the producers of the strings to be compared. • Conversion to a common Unicode encoding form, if necessary. • Expansion of all recognized character escapes and includes. • Testing for bit-by-bit identity. </quote> See also the Unicode standard itself, for example conformance clause C6 (pp 71-72): <quote> C6 A process shall not assume that the interpretations of two canonical-equivalent charac- ter sequences are distinct. • The implications of this conformance clause are twofold. First, a process is never required to give different interpretations to two different, but canonical- equivalent character sequences. Second, no process can assume that another process will make a distinction between two different, but canonical- equivalent character sequences. </quote> The view that two canonically-equivalent sequences of Unicode codepoints are interchangeable representations of the same textual data, and should behave as such, is a pretty fundamental aspect of the encoding standard. I don't think users will be well served by a situation where apparently-identical strings do not necessarily match. > In what parts of CSS might you want unicode normalization to be > done? The only case I can think of is selector matching that > compares attribute values (or, in the future, text content). And > even then it seems like it might be helpful in some cases and > harmful in others. > > (I tend to think we probably don't want it for selectors, both since > we currently have interoperability, and since selectors are > particularly performance-sensitive.) "Interoperability" based on non-normalization is an illusion that may work much of the time but will fail in apparently mysterious ways. Suppose you have an HTML page and associated CSS file that has non- normalized selectors, and browsers don't normalize when matching. It works fine. Later someone edits the CSS with a text editor that automatically normalizes during file open/save (which it would be fully entitled to do, according to the Unicode standard).... you can't see what's broken, but suddenly the CSS and the HTML have different forms for the "same" selector, and it breaks. Or suppose the HTML passes through a process -- perhaps a content filter of some kind -- that applies normalization, but passes the CSS through unchanged because it doesn't care about inspecting it. I doubt performance would be a problem, as in practice selectors will virtually always be in NFC (most often simple ASCII), and you can verify this very cheaply at the same time as doing a "naive" string equality test; then you'd only have to fall back on a more expensive path in extremely rare cases. (Maybe it's time for a test implementation and some performance figures....) JK
Received on Friday, 30 January 2009 00:42:22 UTC