Re: [CSS21][css3-namespace][css3-page][css3-selectors][css3-content] Unicode Normalization from Robert J Burns on 2009-02-05 (public-i18n-core@w3.org from January to March 2009)

From: Robert J Burns <rob@robburns.com>
Date: Thu, 5 Feb 2009 15:37:31 -0600
To: "Tab Atkins Jr." <jackalmage@gmail.com>
Cc: public-i18n-core@w3.org, W3C Style List <www-style@w3.org>
Message-Id: <D5649E7F-8D8C-4D19-9B21-916DCFFAACDE@robburns.com>
HI Tab,

On Feb 5, 2009, at 2:55 PM, Tab Atkins Jr. wrote:

> On Thu, Feb 5, 2009 at 1:59 PM, Robert J Burns <rob@robburns.com>  
> wrote:
>> Hi Tab,
>>
>> On Feb 5, 2009, at 1:42 PM, Tab Atkins Jr. wrote:
>>> It's been stated before that if we were allowed to do a similar  
>>> eager
>>> normalization to a particular normal form (NFC was the suggested  
>>> form,
>>> but the choice is irrelevant here), this really wouldn't pose any
>>> problem.  The issue is that at least one person has stated that  
>>> eager
>>> normalization should not be done.
>>
>> Both Henri and Anne have argued against parser stage normalization  
>> for
>> canonical equivalent character sequences.
>
> True.  As well as I can gather (and please correct me, you two, if I
> am misrepresenting you!), Henri is opposed to it due to the complexity
> of guaranteeing that combining characters will be normalized in the
> face of document.write(), etc., similar to the current issues faced in
> normalizing CRLF.

This can be dealt with by requiring parsers to normalize (including in  
the case of document.write, but not need to normalize the combined  
results of document.write and simultaneously telling authors they  
should not use document.write and that they must never use  
document.write that beings without  a base character.

> Of course, this can be partially handled by simply specifying that UAs
> MAY normalize, but authors must not depend on such normalization
> happening.

This I think is too weak of criteria. Better to require the parser  
normalization and prohibit the use of document.write that starts with  
non-base characters (at the very least the document.write would need  
to start with a space U+0020).

> This would allow browsers to still do the simple
> normalization that is analogous to case normalization in western
> scripts, while avoiding the issues currently faced by CRLF
> normalization.

My scenario also avoids that problem without laying other persistent  
problems at the feet of authors.

> Anne believes that early/eager/parser normalization may violate XML
> 1.0 (though this point was argued).

But Anne has provided no explanation for where that interpretation  
comes from.

> In addition, any normalization
> effort that occurs will require coordination amongst multiple groups
> before it becomes usable.  Thus, Anne believes that if *somebody* has
> to expend effort to solve the normalization issue, it should be
> earlier than the browser, as that requires less coordination and less
> overall work.

But Anne has provided no explanation of how that coordination might  
work. Which authority will coordinate that approach? What  
specifications will be written to direct implementations in that  
approach? I'd try to help out here and offer suggestions, but I have   
no idea what Anne is thinking on this front.

> Both, though, are *more* opposed to normalizing on the fly.
>
>>> Having to handle case normalization on the fly in every string
>>> comparison *would* be a horrific performance hit, which is why it's
>>> done eagerly.  Thus this does not show any Western bias.
>>
>> The topic of discussion in this sub-thread is about parser  
>> normalization (I
>> guess what you're calling eager normalization). I am in favor of it  
>> and
>> Henri is against it. So this is about the same type of performance  
>> hit that
>> case normalization takes at the parser level. Regardless my point  
>> about
>> Western bias is that case sensitivity has been dealt with in all  
>> sorts of
>> ways in nearly every spec. However, canonical normalization has not  
>> been
>> dealt with in any satisfactory way and Henri continues to argue  
>> that it
>> should not be dealt with in a satisfactory way (or how it has been  
>> dealt
>> with should be deemed satisfactory by fiat). At the very least we  
>> need to
>> normalize non-singletons (where the canonical decomposition of the  
>> character
>> is not to only one character). Any combining characters need to be  
>> reordered
>> into the order of their canonical combining class and precomposed  
>> characters
>> need to be normalized (which could still leave the singleton  
>> decompositions
>> that have other authoring problems untouched).
>
> As Henri pointed out in an earlier email to a related thread, NFC
> (frex) normalization is *not* directly analogous to case
> normalization.  Case normalization happens to individual characters,
> and in fact individual *code-points*.  It's an atomic process, within
> the context of the parse stream, and can't be triggered or interrupted
> through script action on the document.

The prohibition on starting text strings with non-base characters  
would be sufficient here. I can't think of any use case supported by  
strings beginning with non-base characters (for documnt.write,  
innerHTML, outerHTML etc.).

> Unicode normalization, on the other hand, is not.  document.write()s
> can inject combining characters mid-stream, or can break up combining
> groups.  This can be very difficult to deal with intelligently.

However with a prohibition on such strings starting with non-base  
characters, the implementations don't have to concern themselves with  
such problems.

> As
> noted, this is analogous to the CRLF normalization that browsers
> currently perform, which Henri says is quite a pain.  Regardless, CRLF
> normalization is fairly necessary.  It affects nearly all authors, is
> required by a vast corpus of legacy content, and is rooted in OS
> behavior which is not likely to change.

Agree, but for many non-Western languages canonical normalization is  
just as pervasive a problem. And this normalization doesn't require  
the parser to back up like it would for CRLF normalization so it is  
much simpler.

> What this boils down to is that late normalization is completely out
> of the question, because it would produce *massive* performance
> penalties and would require an immense amount of work (and certainly
> generate an immense number of bugs), putting it on a time scale of
> "decades".

Agreed. I have heard very few (if any) here advocating for late  
normalization (post-parsing). I'm not sure why it keeps getting raised  
as a point of contention.

> Parser normalization is much better, but still comes with
> baggage that makes it difficult, giving it a time scale of "months to
> years".

No, I don't think it is any more difficult than earlier approaches.  
Much less difficult than teaching authors about the intricacies of  
character to code point mapping, canonical ordering of combining  
characters, etc.

> The best normalization happens at the source, by requiring
> authoring software to emit normalized data.

But again that doesn't solve the problem because one authoring tool  
may output strings in one normalized form and another tool output in  
another normalized form and the later consumer of this content must  
compare strings from two different tools.

> This has a timescale of
> "immediate" if one has authoring tools that do this already for the
> chosen language, and is no worse than parser normalization if no tools
> currently exist.

The problem is that there is no authority providing advice to  
authoring tools that solves the normalization problem. The only advice  
we have is norms that would allow this problem to be solved in the  
parser (the earlier cited C6 from chapter 3 of the Unicode Standard).  
I don't see how this norm could be stretched into a solution that  
authoring tools could be used to solve this problem. So this solution  
isn't immediate but needs to wait for Unicode 5.2 or some other  
standard to address the issue.

On the other hand we have Unicode 5.1 (and going back to Unicode 1.1  
or 1.0) where implementations are told that they should ideally treat  
canonically equivalent strings as matching strings. This norm is  
already there and can be applied to parsers, fonts, input systems etc.  
It could be applied in the immediate next release of any of these  
products. In contrast we should not be expecting (or even asking)  
authors to understand these issues.

Take care,
Rob
Received on Thursday, 5 February 2009 21:38:14 UTC