- From: Boris Zbarsky <bzbarsky@MIT.EDU>
- Date: Mon, 02 Feb 2009 14:38:23 -0500
- To: "Phillips, Addison" <addison@amazon.com>
- CC: "public-i18n-core@w3.org" <public-i18n-core@w3.org>, "www-style@w3.org" <www-style@w3.org>
Phillips, Addison wrote: > The rules say that escapes can be used at any time. Any process that must deal with the document needs to remove escaping before processing dependent on the content. Or are you saying that these don't both match the same selector: > > <p class="a"> > <p class="a"> They do, sure. On the other hand, if you tried to directly insert a UTF-16 surrogate pair using such an escape, that would NOT work the same as having the raw UTF-16 bytes in a UTF-16-encoded document. I realize that the difference is that escapes insert Unicode characters, and that here we're dealing with situations where different sequences of Unicode characters need to be treated identically. In any case, you addressed my question regarding this issue. > Ah... well, that requires you to have fully preprocessed the strings---removing escapes, cleaning whitespace, and (why not?) normalizing the strings if required. Sure. I suggested doing just this. That was judged unacceptable, since it would in fact change things like the localName available to the DOM, no? > NFC is safe to apply--certainly it is safe to apply internally. You can certainly normalize the internal representation of a document Again, that's precisely what I suggested. You said earlier in this thread that parts of the internal representation cannot be normalized. It's not clear to where the line lies between the parts that can and can't, or whether there is such a line in the first place. > Yes, but it is "not my fault" that normalization was ignored in the original implementation :-) The original implementation implemented the specification as written. >> But it's the check to see whether I have to normalize >> that I'm worried about. > > If a document is NFC to start with, you only have to check the first character of any substring to ensure that it is also NFC _Getting_ that first character is expensive. > If the document representation is not NFC, it is safe to apply NFC to it for internal processing purposes See above. > When work on CharMod-Norm was started, the feeling was that "early uniform normalization" (EUN) was the answer. This is precisely what you (and others) are suggesting. What I'm suggesting is that normalization in a browser happen at the same time as conversion from bytes-on-the-wire to Unicode characters happens. > but the reality is that you still have some. Examples? > And... nobody did EUN. Sure. > Note that a document in NFC doesn't guarantee that all operations on that document are in NFC. I'm not sure what this is saying, to be honest. > I don't believe that having the browser normalize portions of the document during the parse process for internal use Which portions are OK to normalize? This brings us back to David's issue where you have to define normalization behavior for every single part of every single specification if you don't specify that normalization happens as data is converted into a stream of Unicode characters... I'm not saying the current situation is great (I think we all agree it's terrible, though I happen to think that the root problem is in the way Unicode was set up in the first place as regards normalization), and I'm not saying we should ignore it. I'm just looking for a solution that can be implemented in reasonable timeframes (months to small number of years, not decades) and that doesn't severely penalize early adapters in the marketplace (e.g. by forcing unacceptable performance penalties on them). -Boris
Received on Monday, 2 February 2009 19:39:09 UTC