Re: [CSS21][css3-namespace][css3-page][css3-selectors][css3-content] Unicode Normalization from Tab Atkins Jr. on 2009-02-05 (www-style@w3.org from February 2009)

From: Tab Atkins Jr. <jackalmage@gmail.com>
Date: Thu, 5 Feb 2009 14:55:56 -0600
To: Robert J Burns <rob@robburns.com>
Cc: public-i18n-core@w3.org, W3C Style List <www-style@w3.org>
Message-ID: <dd0fbad0902051255y15275176sbbe3532cee3ee72e@mail.gmail.com>
On Thu, Feb 5, 2009 at 1:59 PM, Robert J Burns <rob@robburns.com> wrote:
> Hi Tab,
>
> On Feb 5, 2009, at 1:42 PM, Tab Atkins Jr. wrote:
>> It's been stated before that if we were allowed to do a similar eager
>> normalization to a particular normal form (NFC was the suggested form,
>> but the choice is irrelevant here), this really wouldn't pose any
>> problem.  The issue is that at least one person has stated that eager
>> normalization should not be done.
>
> Both Henri and Anne have argued against parser stage normalization for
> canonical equivalent character sequences.

True.  As well as I can gather (and please correct me, you two, if I
am misrepresenting you!), Henri is opposed to it due to the complexity
of guaranteeing that combining characters will be normalized in the
face of document.write(), etc., similar to the current issues faced in
normalizing CRLF.

Of course, this can be partially handled by simply specifying that UAs
MAY normalize, but authors must not depend on such normalization
happening.  This would allow browsers to still do the simple
normalization that is analogous to case normalization in western
scripts, while avoiding the issues currently faced by CRLF
normalization.

Anne believes that early/eager/parser normalization may violate XML
1.0 (though this point was argued).  In addition, any normalization
effort that occurs will require coordination amongst multiple groups
before it becomes usable.  Thus, Anne believes that if *somebody* has
to expend effort to solve the normalization issue, it should be
earlier than the browser, as that requires less coordination and less
overall work.

Both, though, are *more* opposed to normalizing on the fly.

>> Having to handle case normalization on the fly in every string
>> comparison *would* be a horrific performance hit, which is why it's
>> done eagerly.  Thus this does not show any Western bias.
>
> The topic of discussion in this sub-thread is about parser normalization (I
> guess what you're calling eager normalization). I am in favor of it and
> Henri is against it. So this is about the same type of performance hit that
> case normalization takes at the parser level. Regardless my point about
> Western bias is that case sensitivity has been dealt with in all sorts of
> ways in nearly every spec. However, canonical normalization has not been
> dealt with in any satisfactory way and Henri continues to argue that it
> should not be dealt with in a satisfactory way (or how it has been dealt
> with should be deemed satisfactory by fiat). At the very least we need to
> normalize non-singletons (where the canonical decomposition of the character
> is not to only one character). Any combining characters need to be reordered
> into the order of their canonical combining class and precomposed characters
> need to be normalized (which could still leave the singleton decompositions
> that have other authoring problems untouched).

As Henri pointed out in an earlier email to a related thread, NFC
(frex) normalization is *not* directly analogous to case
normalization.  Case normalization happens to individual characters,
and in fact individual *code-points*.  It's an atomic process, within
the context of the parse stream, and can't be triggered or interrupted
through script action on the document.

Unicode normalization, on the other hand, is not.  document.write()s
can inject combining characters mid-stream, or can break up combining
groups.  This can be very difficult to deal with intelligently.  As
noted, this is analogous to the CRLF normalization that browsers
currently perform, which Henri says is quite a pain.  Regardless, CRLF
normalization is fairly necessary.  It affects nearly all authors, is
required by a vast corpus of legacy content, and is rooted in OS
behavior which is not likely to change.


What this boils down to is that late normalization is completely out
of the question, because it would produce *massive* performance
penalties and would require an immense amount of work (and certainly
generate an immense number of bugs), putting it on a time scale of
"decades".  Parser normalization is much better, but still comes with
baggage that makes it difficult, giving it a time scale of "months to
years".  The best normalization happens at the source, by requiring
authoring software to emit normalized data.  This has a timescale of
"immediate" if one has authoring tools that do this already for the
chosen language, and is no worse than parser normalization if no tools
currently exist.

~TJ
Received on Thursday, 5 February 2009 20:56:32 UTC