Re: [CSS21][css3-namespace][css3-page][css3-selectors][css3-content] Unicode Normalization from Boris Zbarsky on 2009-02-02 (www-style@w3.org from February 2009)

From: Boris Zbarsky <bzbarsky@MIT.EDU>
Date: Mon, 02 Feb 2009 14:38:23 -0500
To: "Phillips, Addison" <addison@amazon.com>
CC: "public-i18n-core@w3.org" <public-i18n-core@w3.org>, "www-style@w3.org" <www-style@w3.org>
Message-ID: <49874BAF.2010403@mit.edu>

Phillips, Addison wrote:
> The rules say that escapes can be used at any time. Any process that must deal with the document needs to remove escaping before processing dependent on the content. Or are you saying that these don't both match the same selector:
> 
> <p class="a">
> <p class="&#97;">

They do, sure.  On the other hand, if you tried to directly insert a 
UTF-16 surrogate pair using such an escape, that would NOT work the same 
as having the raw UTF-16 bytes in a UTF-16-encoded document.  I realize 
that the difference is that escapes insert Unicode characters, and that 
here we're dealing with situations where different sequences of Unicode 
characters need to be treated identically.

In any case, you addressed my question regarding this issue.

> Ah... well, that requires you to have fully preprocessed the strings---removing escapes, cleaning whitespace, and (why not?) normalizing the strings if required.

Sure.  I suggested doing just this.  That was judged unacceptable, since 
it would in fact change things like the localName available to the DOM, no?

> NFC is safe to apply--certainly it is safe to apply internally. You can certainly normalize the internal representation of a document

Again, that's precisely what I suggested.  You said earlier in this 
thread that parts of the internal representation cannot be normalized. 
It's not clear to where the line lies between the parts that can and 
can't, or whether there is such a line in the first place.

> Yes, but it is "not my fault" that normalization was ignored in the original implementation :-)

The original implementation implemented the specification as written.

>> But it's the check to see whether I have to normalize
>> that I'm worried about.
> 
> If a document is NFC to start with, you only have to check the first character of any substring to ensure that it is also NFC

_Getting_ that first character is expensive.

> If the document representation is not NFC, it is safe to apply NFC to it for internal processing purposes

See above.

> When work on CharMod-Norm was started, the feeling was that "early uniform normalization" (EUN) was the answer. This is precisely what you (and others) are suggesting.

What I'm suggesting is that normalization in a browser happen at the 
same time as conversion from bytes-on-the-wire to Unicode characters 
happens.

> but the reality is that you still have some. 

Examples?

> And... nobody did EUN.

Sure.

> Note that a document in NFC doesn't guarantee that all operations on that document are in NFC.

I'm not sure what this is saying, to be honest.

> I don't believe that having the browser normalize portions of the document during the parse process for internal use

Which portions are OK to normalize?

This brings us back to David's issue where you have to define 
normalization behavior for every single part of every single 
specification if you don't specify that normalization happens as data is 
converted into a stream of Unicode characters...

I'm not saying the current situation is great (I think we all agree it's 
terrible, though I happen to think that the root problem is in the way 
Unicode was set up in the first place as regards normalization), and I'm 
not saying we should ignore it.  I'm just looking for a solution that 
can be implemented in reasonable timeframes (months to small number of 
years, not decades) and that doesn't severely penalize early adapters in 
the marketplace (e.g. by forcing unacceptable performance penalties on 
them).

-Boris

Received on Monday, 2 February 2009 19:39:08 UTC