Re: [CSS21][css3-namespace][css3-page][css3-selectors][css3-content] Unicode Normalization from Boris Zbarsky on 2009-02-02 (www-style@w3.org from February 2009)

From: Boris Zbarsky <bzbarsky@MIT.EDU>
Date: Mon, 02 Feb 2009 13:23:15 -0500
To: "Phillips, Addison" <addison@amazon.com>
CC: "public-i18n-core@w3.org" <public-i18n-core@w3.org>, "www-style@w3.org" <www-style@w3.org>
Message-ID: <49873A13.8000905@mit.edu>
Phillips, Addison wrote:
> The question of "statistical relevance" is, I think, a red herring.

Not at all.  If there are, in practice, no reasons for someone to be 
doing something, then there is more leeway in terms of what handling can 
be allowed.  I'm not asking "how often" in terms of web pages, but "how 
often" in terms of web pages that use the characters in question at all.

It sounds like there are plenty good reasons for someone to be using 
escapes to insert combining marks.

> Yes, Western European languages are permitted to use combining marks.
[etc]

I don't see what this has to do with the question I actually asked, for 
what it's worth.

> However, this is a problem of universal access. Many languages that rely on combining marks are minority languages that face other pressures (declining native literacy; majority language education; lack of vendor support). The speakers of these languages are expected to surmount many hurdles---with keyboards, fonts, etc. etc. The idea that the pressure should be on these users to deal with these issues is exclusionary. 

You seem to be addressing an argument that someone else, not I, made...

> On the question of performance, Anne's point about the comparison is incomplete. Yes, you only do a strcmp() in your code today.

You apparently didn't understand my mail on performance.  We do NOT do 
an strcmp() today.  That would have an unacceptable performance cost. 
We (Gecko, in this case) intern all the relevant strings at parse-time 
and perform comparisons by comparing the interned string identifiers. 
This is a single equality comparison of a pair of native machine words 
(the pointers to the interned strings, to be precise).

> First, any two strings that are equal are, well, equal. Normalizing them both won't change that. So an obvious performance boost is to call strcmp() first.

That doesn't help, because in the common case selectors in fact do not 
match.  So detecting matching quicker is actually not much use.  What's 
needed is detecting that the selector doesn't match as quickly as possible.

> But the real performance test isn't merely the strcmp(). Selectors contains wildcards and other operations.

Very rarely.  The vast majority of selectors contain at least one direct 
string comparison operation (id match, tag name match, class name 
match), and these are performed first.  If they don't match (common 
case) then nothing needs to be done for the more expensive parts of the 
selector.

> And the comparisons are done on the document tree (there isn't just a single comparison).

Indeed.  If there were just a single comparison no one would be worried 
about its performance!

> The overhead of normalization-checking the various comparable items is pretty small compared to the total execution time of the selection algorithm.

Do you actually have any data to back this up?  The fact is, the 
selection algorithm is highly optimized in most modern browsers (because 
it gets run so much), and normalization-checking might not be as cheap 
as you seem to think it is (for example, it requires either walking the 
entire string or flagging at internment time whether the string might 
require normalization, or something else).  There is nontrivial cost in 
either memory or performance or both compared to the comparisons that 
are done now.

> Since (let's call it) 97% of the time you won't have to normalize anything

I fully expect that I don't have to normalize anything far more often 
than that.  But it's the check to see whether I have to normalize that 
I'm worried about.

It basically sounds to me like there is a broken design on a lower level 
here and we're asking all sorts of other software and specifications to 
work around that breakage, to be honest...  That might well be needed, 
and wouldn't be the first time it's needed, but would the energy be more 
productively channeled into fixing the design?

Put another way, if we're looking at a multi-year deployment timeframe 
for Selectors implementations that perform normalization then is 
Selectors the right place to be doing normalization?  Or would it be 
better to spend the time putting in normalization on a lower level?  You 
say that this is not compatible with the current state of software; are 
there any estimates of what it would take to shift that state the way 
you're trying to shift the state of browsers?

-Boris
Received on Monday, 2 February 2009 18:23:59 UTC