RE: [CSS21][css3-namespace][css3-page][css3-selectors][css3-content] Unicode Normalization from Phillips, Addison on 2009-02-02 (public-i18n-core@w3.org from January to March 2009)

From: Phillips, Addison <addison@amazon.com>
Date: Mon, 2 Feb 2009 09:53:23 -0800
To: Boris Zbarsky <bzbarsky@MIT.EDU>
CC: "public-i18n-core@w3.org" <public-i18n-core@w3.org>, "www-style@w3.org" <www-style@w3.org>
Message-ID: <4D25F22093241741BC1D0EEBC2DBB1DA017DA5F40D@EX-SEA5-D.ant.amazon.com>

Boris Zbarsky wrote:

> > Absolutely not reasonable. Some scripts *require* the use of
> combining marks.
> 
> I understand them.  The question is how often such combining marks
> are
> inserted via character escapes.  Or possibly even how often they're
> inserted via character escapes when they have nothing to combine
> with...
> 

No, not that either. There are many reasons why a combining mark (or any other character) might be presented as an escape. For example, the document's character encoding might not include the character.

The question of "statistical relevance" is, I think, a red herring.

Yes, Western European languages are permitted to use combining marks. However, the fact is that virtually all of these languages use precomposed characters: their input systems, fonts, display systems, etc are all skewed towards producing precomposed characters and it takes special intervention to produce non-normalized content. If you look at the total of Web content today, you'll find that if you eliminate documents that use only the Latin-1 and Han (Chinese) ideograph sets of characters (note that I do not say "encodings"--these documents might very well use UTF-8), you have effectively eliminated 80% of all documents. This discussion isn't really about "majoritarian" needs and it would be wrong to represent it as such. Most of the languages that are plausibly affected are relatively obscure, at least today.

However, this is a problem of universal access. Many languages that rely on combining marks are minority languages that face other pressures (declining native literacy; majority language education; lack of vendor support). The speakers of these languages are expected to surmount many hurdles---with keyboards, fonts, etc. etc. The idea that the pressure should be on these users to deal with these issues is exclusionary. 

Other affected languages do not face this same level of challenge. Nonetheless, languages such as Vietnamese or Burmese do not and will never form more than a very small percentage of total Web content. Should we not address the Unicode-based requirements that such languages present just because most of the Internet is in English and/or Chinese??

On the question of performance, Anne's point about the comparison is incomplete. Yes, you only do a strcmp() in your code today. However, there are two problems with this observation.

First, any two strings that are equal are, well, equal. Normalizing them both won't change that. So an obvious performance boost is to call strcmp() first.

But the real performance test isn't merely the strcmp(). Selectors contains wildcards and other operations. And the comparisons are done on the document tree (there isn't just a single comparison). The overhead of normalization-checking the various comparable items is pretty small compared to the total execution time of the selection algorithm. Since (let's call it) 97% of the time you won't have to normalize anything, you can then proceed to do strcmp() with no further performance degradation.

So, I agree that normalization is a pain and that it is slower than strcmp() and that it doesn't affect the (vast??) majority of users. But it *is* a real problem and it does affect users, specifically those whose languages rely on combining marks. Wishing that all of our text editors did NFC is "nice", but not realistic.

And I do suggest that thread participants go back and take a hard look at CharMod-Norm. The Internationalization WG is changing direction on this document. Today the document represents precisely this desire for "early uniform normalization". But the WG has come to the conclusion that this is impossible to reconcile with the current state of software. Non-NFC documents are widely proliferated and unlikely to go away. As a result, specs such as CSS3 Selectors and others must address normalization---positively or not---or affected users will be unable to figure out why "their language doesn't work with the Internet" or why visually indistinguishable strings that are canonically equivalent are not, for some reason, equal.

Regards,

Addison

Addison Phillips
Globalization Architect -- Lab126

Internationalization is not a feature.
It is an architecture.

Received on Monday, 2 February 2009 17:54:04 UTC