Re: [CSS21][css3-namespace][css3-page][css3-selectors][css3-content] Unicode Normalization from L. David Baron on 2009-02-02 (public-i18n-core@w3.org from January to March 2009)

From: L. David Baron <dbaron@dbaron.org>
Date: Sun, 1 Feb 2009 18:57:55 -0800
To: Boris Zbarsky <bzbarsky@MIT.EDU>
Cc: public-i18n-core@w3.org, www-style@w3.org
Message-ID: <20090202025755.GA9377@pickering.dbaron.org>

On Sunday 2009-02-01 11:13 -0500, Boris Zbarsky wrote:
> One question I have is whether this issue would be resolved if a UA  
> performed parse-time normalization on everything (JS, CSS, XML, HTML).  
> That wouldn't completely help JS because you can build up strings  
> codepoint-by-codepoint but that also lets you create invalid UTF-16  
> strings, so I'm not sure it's worth worrying about right now.

I think parse-time normalization of everything, as Boris describes,
is the only reasonable solution here if we decide Unicode
normalization is important.

It solves the problem all at once, without having to worry about
changing the rules for selector matching, tons of distinct DOM APIs,
etc., etc.  (It might cause a little pain to those using escaped
characters, e.g.  numeric character references in HTML/XML, escaped
codepoints in CSS or JS, but that pain would largely be
transitional, if they're depending on matching whichever
normalization is considered non-canonical.)

It also happens once, probably right after character encoding
conversion happens, in the process where we receive bytes from the
network, convert the bytes into an internal character representation
(typically UTF-8 or UTF-16) according to the encoding, and then
parse that.

Sticking normalization into the process of selector matching would
solve only a small part of the problem, and would stick an extra
step into a process that's already fundamentally quadratic
computational complexity (selectors * elements).

-David

-- 
L. David Baron                                 http://dbaron.org/
Mozilla Corporation                       http://www.mozilla.com/

Received on Monday, 2 February 2009 02:58:46 UTC