- From: Tab Atkins Jr. <jackalmage@gmail.com>
- Date: Thu, 5 Feb 2009 14:55:56 -0600
- To: Robert J Burns <rob@robburns.com>
- Cc: public-i18n-core@w3.org, W3C Style List <www-style@w3.org>
On Thu, Feb 5, 2009 at 1:59 PM, Robert J Burns <rob@robburns.com> wrote: > Hi Tab, > > On Feb 5, 2009, at 1:42 PM, Tab Atkins Jr. wrote: >> It's been stated before that if we were allowed to do a similar eager >> normalization to a particular normal form (NFC was the suggested form, >> but the choice is irrelevant here), this really wouldn't pose any >> problem. The issue is that at least one person has stated that eager >> normalization should not be done. > > Both Henri and Anne have argued against parser stage normalization for > canonical equivalent character sequences. True. As well as I can gather (and please correct me, you two, if I am misrepresenting you!), Henri is opposed to it due to the complexity of guaranteeing that combining characters will be normalized in the face of document.write(), etc., similar to the current issues faced in normalizing CRLF. Of course, this can be partially handled by simply specifying that UAs MAY normalize, but authors must not depend on such normalization happening. This would allow browsers to still do the simple normalization that is analogous to case normalization in western scripts, while avoiding the issues currently faced by CRLF normalization. Anne believes that early/eager/parser normalization may violate XML 1.0 (though this point was argued). In addition, any normalization effort that occurs will require coordination amongst multiple groups before it becomes usable. Thus, Anne believes that if *somebody* has to expend effort to solve the normalization issue, it should be earlier than the browser, as that requires less coordination and less overall work. Both, though, are *more* opposed to normalizing on the fly. >> Having to handle case normalization on the fly in every string >> comparison *would* be a horrific performance hit, which is why it's >> done eagerly. Thus this does not show any Western bias. > > The topic of discussion in this sub-thread is about parser normalization (I > guess what you're calling eager normalization). I am in favor of it and > Henri is against it. So this is about the same type of performance hit that > case normalization takes at the parser level. Regardless my point about > Western bias is that case sensitivity has been dealt with in all sorts of > ways in nearly every spec. However, canonical normalization has not been > dealt with in any satisfactory way and Henri continues to argue that it > should not be dealt with in a satisfactory way (or how it has been dealt > with should be deemed satisfactory by fiat). At the very least we need to > normalize non-singletons (where the canonical decomposition of the character > is not to only one character). Any combining characters need to be reordered > into the order of their canonical combining class and precomposed characters > need to be normalized (which could still leave the singleton decompositions > that have other authoring problems untouched). As Henri pointed out in an earlier email to a related thread, NFC (frex) normalization is *not* directly analogous to case normalization. Case normalization happens to individual characters, and in fact individual *code-points*. It's an atomic process, within the context of the parse stream, and can't be triggered or interrupted through script action on the document. Unicode normalization, on the other hand, is not. document.write()s can inject combining characters mid-stream, or can break up combining groups. This can be very difficult to deal with intelligently. As noted, this is analogous to the CRLF normalization that browsers currently perform, which Henri says is quite a pain. Regardless, CRLF normalization is fairly necessary. It affects nearly all authors, is required by a vast corpus of legacy content, and is rooted in OS behavior which is not likely to change. What this boils down to is that late normalization is completely out of the question, because it would produce *massive* performance penalties and would require an immense amount of work (and certainly generate an immense number of bugs), putting it on a time scale of "decades". Parser normalization is much better, but still comes with baggage that makes it difficult, giving it a time scale of "months to years". The best normalization happens at the source, by requiring authoring software to emit normalized data. This has a timescale of "immediate" if one has authoring tools that do this already for the chosen language, and is no worse than parser normalization if no tools currently exist. ~TJ
Received on Thursday, 5 February 2009 20:56:32 UTC