Re: Unicode Normalization v. new specification of input method criteria from Robert J Burns on 2009-02-11 (www-style@w3.org from February 2009)

From: Robert J Burns <rob@robburns.com>
Date: Wed, 11 Feb 2009 03:55:03 -0600
To: Henri Sivonen <hsivonen@iki.fi>
Cc: public-i18n-core@w3.org, W3C Style List <www-style@w3.org>
Message-Id: <88AAF5D3-6AED-42AD-B2C1-D2F8C583CA9F@robburns.com>
Hi Henri,

On Feb 11, 2009, at 1:43 AM, Henri Sivonen wrote:

> On Feb 10, 2009, at 19:00, Robert J Burns wrote:
>
>> Having the example back helps dramatically. However, you've taken  
>> the issue and boiled it down to the solved portion, ignoring what  
>> the thrust of the thread was about.
>
> What was the thrust of the i18n core comments then? Except for your  
> remarks, as far as I can tell, the thread has revolved around  
> keyboard input order or differences in input methods between  
> operating systems causing different code point sequences for same  
> visual apperances.

Except for my remarks? I think you should go back and re-read the  
thread. The concern has been over canonically equivalent identifier  
strings that have a potential to be falsely rejected as matches if  
implementations perform bytewise comparison rather than Unicode  
equivalent string comparison. Yes, it's clear you're interventions  
reveal your thinking that if you can turn this into an input method  
issue, then nothing needs to be done about it (at least from the W3C  
perspective). I and others have pointed out problems with that line of  
thinking (including that 1. Unicode has nothing at all about  
requirements for input methods to produce normalized character  
sequences; and 2. even if Unicode required input methods produce  
normalized character sequences, Unicode still promotes two different  
normalized forms each of which will fail to match with bytewise string  
comparison).

> If that's a solved problem, great!

It may solve the problem you're trying to twist the discussion toward,  
but it doesn't come close to solving the problem on which this thread  
has focussed.

> I realize that the CSS WG doesn't work by the HTML Design  
> Principles, but since decision on the CSS side would leak to HTML  
> IDs and class names, I'm looking at this from the point of view of  
> the impact on HTML IDs and class names in the light of the HTML  
> Design Principles. The Support World Languages and Solve Real  
> Problems principles would apply.
>
> Unicode Normalization in the abstract is not a Real Problem. It  
> *may* be a *solution* to Real Problems.
>
> The i18n principle is Support World Languages--not Use All Unicode  
> Algorithms.

Obviously I wasn't suggesting we need to implement all Unicode  
algorithms. However I am suggesting that calling an implementation a  
Unicode implementation that treats Unicode equivalent strings as non- 
equivalent is messed up. The discussion however is over two strings  
that vary only by the order of grapheme extenders when the different  
order implies no semantic difference and also involves no visual  
distinction. Unicode prescribes a solution to this problem already:  
comparison of normalized character sequences and not bytewise  
comparison of character sequences. If you have a better solution to  
that problem then what Unicode already recommends, that's great. Share  
it with us.

> Thus, we should see if there are world languages whose users  
> *actually* face problems with the way IDs and class names are  
> matched. If we find none, there's nothing to fix and we can stop  
> even if Unicode allows such cases to be constructed.

Richard Ishida, Jonathan Kew, and many others have all demonstrated  
cases where such problems would occur (even with Mac OS X keyboard  
input method). You're own example that the fl and fi ligatures are on  
the Mac OS X keyboard is a great example of a case where keyboard  
input methods – over 15 years after the introduction of Unicode —  
still do not support proper keyboard input (i.e., no one in an era of  
Unicode text and OpenType and AAT fonts should be entering ligatures  
into a document from anywhere, much less their keyboard).

> If there are some, then we should consider how those actual problems  
> are best addressed. (It's not a given that the best solution is  
> changing the consumer side.)
> In particular, we shouldn't assume a late-normalization hammer and  
> start constructing (perhaps using Character Palette on OS X) nails  
> that we can hit with it.

I can't imagine how you can say we're trying to tailor the problem to  
the normalization solution. The companion page I created[1] lists many  
different solutions to address the issue we've been discussing:  
including new norms to guide input method implementors. So I don't  
know how you can claim that I have been trying to turn this into a NFC  
normalization problem (I propose other normalization approaches there  
too). You yourself have insisted that NFC normalization from input  
methods is the only thing that makes sense, while I have suggested  
other normalization forms might be a more precise approach to address  
this issue (and lead to better performing implementations that address  
the issue). However with 15 years of Unicode 1.1 to 5.1 around with no  
such requirements directed at Unicode input methods, its hard to  
imagine how to handle this issue without some consumer-level approach.  
If you have some genuine ideas that aren't just repeating how Mac OS  
X's keyboard input solves everything, I'd be happy to listen. But  
simply saying its all the fault and responsibility of input method  
implementors, just makes no sense to me.

On Feb 11, 2009, at 3:19 AM, Anne van Kesteren wrote:
> On Wed, 11 Feb 2009 09:12:55 +0100, Ambrose Li <ambrose.li@gmail.com>
> wrote:
>>  Pardon my ignorance too, but this is complete news to me. As far  
>> as I
>>  can tell the discussion was not "revolved around" input methods at
>>  all. IME was part of the discussion, but in no way was the focus.
>
> As far as I can tell Henri is right. The reason the i18n WG wants this
> solved on the user agent side is because the authoring side is
> inconsistent in chosing a particular Unicode Normalization Form.

If that's all Henri is saying, I don't think we have much  
disagreement. Yes after over 15 years or more of Unicode providing no  
clear norms for input methods and content producing tools to normalize  
character sequences in a unified manner, it is necessary to do so on  
the consumer side. I wasn't sure Henri understood that.

However, the issue goes further than that even, since Unicode (and XML  
and CSS) simultaneously supports transcoding from any other character  
set encoding. That means that as a parser converts other character  
streams into UCS code points those too will be in a different, likely  
non-canonical, order. In fact the web needs to support situations  
where a CSS document is in an ISO encoding and an HTML document is in  
a UTF encoding and this introduces another place for bytewise  
comparisons to break. There's no way I can imagine to control all of  
this at the input method layer. So if that is the case, we need to  
look at consumer-side normalization (of some form) to address the  
issue (unless there's some other solution I'm forgetting).

Take care,
Rob

[1]: <http://esw.w3.org/topic/I18N/CanonicalNormalizationIssues?action=show#head-f95f8528a2cf87256a33bd3042b3ae595a0bf5e5 
 >
Received on Wednesday, 11 February 2009 09:55:45 UTC