W3C home > Mailing lists > Public > www-style@w3.org > February 2009

RE: [CSS21][css3-namespace][css3-page][css3-selectors][css3-content] Unicode Normalization

From: Phillips, Addison <addison@amazon.com>
Date: Mon, 2 Feb 2009 11:21:45 -0800
To: Boris Zbarsky <bzbarsky@MIT.EDU>
CC: "public-i18n-core@w3.org" <public-i18n-core@w3.org>, "www-style@w3.org" <www-style@w3.org>
Message-ID: <4D25F22093241741BC1D0EEBC2DBB1DA017DA5F5DB@EX-SEA5-D.ant.amazon.com>
Boris wrote:

> > The question of "statistical relevance" is, I think, a red
> herring.
> Not at all.  If there are, in practice, no reasons for someone to
> be
> doing something, then there is more leeway in terms of what
> handling can
> be allowed.  I'm not asking "how often" in terms of web pages, but
> "how
> often" in terms of web pages that use the characters in question at
> all.

The rules say that escapes can be used at any time. Any process that must deal with the document needs to remove escaping before processing dependent on the content. Or are you saying that these don't both match the same selector:

<p class="a">
<p class="&#97;"><!-- decimal 97 == 0x61 == 'a' -->

> It sounds like there are plenty good reasons for someone to be
> using escapes to insert combining marks.

Yes. The usual reason, though, is that the document's character encoding doesn't support the character. Browsers, for example, often substitute an escape for a character that can't be represented in the current document encoding. Using UTF-8, of course, avoids that. Other reasons may have to do with the storage and processing of the document.

> > On the question of performance, Anne's point about the comparison
> is incomplete. Yes, you only do a strcmp() in your code today.
> You apparently didn't understand my mail on performance.  We do NOT
> do
> an strcmp() today.  That would have an unacceptable performance
> cost.
> We (Gecko, in this case) intern all the relevant strings at parse-
> time
> and perform comparisons by comparing the interned string
> identifiers.
> This is a single equality comparison of a pair of native machine
> words
> (the pointers to the interned strings, to be precise).

Ah... well, that requires you to have fully preprocessed the strings---removing escapes, cleaning whitespace, and (why not?) normalizing the strings if required. NFC is safe to apply--certainly it is safe to apply internally. You can certainly normalize the internal representation of a document, especially if the operation in question (selectors, in this case) requires you to do so.

> The fact is, the
> selection algorithm is highly optimized in most modern browsers
> (because
> it gets run so much), and normalization-checking might not be as
> cheap
> as you seem to think it is (for example, it requires either walking
> the
> entire string or flagging at internment time whether the string
> might
> require normalization, or something else).  There is nontrivial
> cost in
> either memory or performance or both compared to the comparisons
> that
> are done now.

Yes, but it is "not my fault" that normalization was ignored in the original implementation :-). The issue was known about. It just wasn't dealt with. Some amount of refactoring (assuming we do anything at all) will be required regardless of the solution chosen.

> > Since (let's call it) 97% of the time you won't have to normalize
> anything
> I fully expect that I don't have to normalize anything far more
> often than that.  

I agree.

> But it's the check to see whether I have to normalize
> that I'm worried about.

If a document is NFC to start with, you only have to check the first character of any substring to ensure that it is also NFC. If the document representation is not NFC, it is safe to apply NFC to it for internal processing purposes, although you may wish to render the original (non-normalized) character sequences.

> It basically sounds to me like there is a broken design on a lower
> level
> here and we're asking all sorts of other software and
> specifications to
> work around that breakage, to be honest...  That might well be
> needed,
> and wouldn't be the first time it's needed, but would the energy be
> more
> productively channeled into fixing the design?

The problem here is that Unicode normalization exists as an issue that developers (and hence Specifications) need to take into account when working with Unicode (the document character set of virtually all W3C formats). The solution, as embodied on this thread and elsewhere, has mostly been to try and declare it is a non-issue---to duck and avoid it at all costs "because it doesn't affect anybody, really". 

When work on CharMod-Norm was started, the feeling was that "early uniform normalization" (EUN) was the answer. This is precisely what you (and others) are suggesting. If all documents are form NFC to begin with, we don't have as much of this normalization checking... but the reality is that you still have some. 

And... nobody did EUN.

Note that a document in NFC doesn't guarantee that all operations on that document are in NFC. Selectors fit the description of a normalization-sensitive operation. They need to ensure that the substrings of the document that they are working on are themselves normalized (a relatively trivial check, if the document itself is "fully-normalized"). Or...

... or we all decide to permanently and irrevocably punt on the normalization issue. We give people information on the problem and hope that they will normalize their documents properly themselves.

I don't believe that having the browser normalize portions of the document during the parse process for internal use is that risky or bad. As you suggest, this probably is handled at a higher level than Selectors per-se. However.... it needs to be mentioned in Selectors because the behavior needs to be well-defined. As written today, it is perfectly valid for selectors to "not match" two canonically equivalent (from a Unicode normalization perspective) strings.


Received on Monday, 2 February 2009 19:23:10 UTC

This archive was generated by hypermail 2.3.1 : Monday, 2 May 2016 14:38:23 UTC