- From: Robert J Burns <rob@robburns.com>
- Date: Wed, 11 Feb 2009 03:55:03 -0600
- To: Henri Sivonen <hsivonen@iki.fi>
- Cc: public-i18n-core@w3.org, W3C Style List <www-style@w3.org>
Hi Henri, On Feb 11, 2009, at 1:43 AM, Henri Sivonen wrote: > On Feb 10, 2009, at 19:00, Robert J Burns wrote: > >> Having the example back helps dramatically. However, you've taken >> the issue and boiled it down to the solved portion, ignoring what >> the thrust of the thread was about. > > What was the thrust of the i18n core comments then? Except for your > remarks, as far as I can tell, the thread has revolved around > keyboard input order or differences in input methods between > operating systems causing different code point sequences for same > visual apperances. Except for my remarks? I think you should go back and re-read the thread. The concern has been over canonically equivalent identifier strings that have a potential to be falsely rejected as matches if implementations perform bytewise comparison rather than Unicode equivalent string comparison. Yes, it's clear you're interventions reveal your thinking that if you can turn this into an input method issue, then nothing needs to be done about it (at least from the W3C perspective). I and others have pointed out problems with that line of thinking (including that 1. Unicode has nothing at all about requirements for input methods to produce normalized character sequences; and 2. even if Unicode required input methods produce normalized character sequences, Unicode still promotes two different normalized forms each of which will fail to match with bytewise string comparison). > If that's a solved problem, great! It may solve the problem you're trying to twist the discussion toward, but it doesn't come close to solving the problem on which this thread has focussed. > I realize that the CSS WG doesn't work by the HTML Design > Principles, but since decision on the CSS side would leak to HTML > IDs and class names, I'm looking at this from the point of view of > the impact on HTML IDs and class names in the light of the HTML > Design Principles. The Support World Languages and Solve Real > Problems principles would apply. > > Unicode Normalization in the abstract is not a Real Problem. It > *may* be a *solution* to Real Problems. > > The i18n principle is Support World Languages--not Use All Unicode > Algorithms. Obviously I wasn't suggesting we need to implement all Unicode algorithms. However I am suggesting that calling an implementation a Unicode implementation that treats Unicode equivalent strings as non- equivalent is messed up. The discussion however is over two strings that vary only by the order of grapheme extenders when the different order implies no semantic difference and also involves no visual distinction. Unicode prescribes a solution to this problem already: comparison of normalized character sequences and not bytewise comparison of character sequences. If you have a better solution to that problem then what Unicode already recommends, that's great. Share it with us. > Thus, we should see if there are world languages whose users > *actually* face problems with the way IDs and class names are > matched. If we find none, there's nothing to fix and we can stop > even if Unicode allows such cases to be constructed. Richard Ishida, Jonathan Kew, and many others have all demonstrated cases where such problems would occur (even with Mac OS X keyboard input method). You're own example that the fl and fi ligatures are on the Mac OS X keyboard is a great example of a case where keyboard input methods – over 15 years after the introduction of Unicode — still do not support proper keyboard input (i.e., no one in an era of Unicode text and OpenType and AAT fonts should be entering ligatures into a document from anywhere, much less their keyboard). > If there are some, then we should consider how those actual problems > are best addressed. (It's not a given that the best solution is > changing the consumer side.) > In particular, we shouldn't assume a late-normalization hammer and > start constructing (perhaps using Character Palette on OS X) nails > that we can hit with it. I can't imagine how you can say we're trying to tailor the problem to the normalization solution. The companion page I created[1] lists many different solutions to address the issue we've been discussing: including new norms to guide input method implementors. So I don't know how you can claim that I have been trying to turn this into a NFC normalization problem (I propose other normalization approaches there too). You yourself have insisted that NFC normalization from input methods is the only thing that makes sense, while I have suggested other normalization forms might be a more precise approach to address this issue (and lead to better performing implementations that address the issue). However with 15 years of Unicode 1.1 to 5.1 around with no such requirements directed at Unicode input methods, its hard to imagine how to handle this issue without some consumer-level approach. If you have some genuine ideas that aren't just repeating how Mac OS X's keyboard input solves everything, I'd be happy to listen. But simply saying its all the fault and responsibility of input method implementors, just makes no sense to me. On Feb 11, 2009, at 3:19 AM, Anne van Kesteren wrote: > On Wed, 11 Feb 2009 09:12:55 +0100, Ambrose Li <ambrose.li@gmail.com> > wrote: >> Pardon my ignorance too, but this is complete news to me. As far >> as I >> can tell the discussion was not "revolved around" input methods at >> all. IME was part of the discussion, but in no way was the focus. > > As far as I can tell Henri is right. The reason the i18n WG wants this > solved on the user agent side is because the authoring side is > inconsistent in chosing a particular Unicode Normalization Form. If that's all Henri is saying, I don't think we have much disagreement. Yes after over 15 years or more of Unicode providing no clear norms for input methods and content producing tools to normalize character sequences in a unified manner, it is necessary to do so on the consumer side. I wasn't sure Henri understood that. However, the issue goes further than that even, since Unicode (and XML and CSS) simultaneously supports transcoding from any other character set encoding. That means that as a parser converts other character streams into UCS code points those too will be in a different, likely non-canonical, order. In fact the web needs to support situations where a CSS document is in an ISO encoding and an HTML document is in a UTF encoding and this introduces another place for bytewise comparisons to break. There's no way I can imagine to control all of this at the input method layer. So if that is the case, we need to look at consumer-side normalization (of some form) to address the issue (unless there's some other solution I'm forgetting). Take care, Rob [1]: <http://esw.w3.org/topic/I18N/CanonicalNormalizationIssues?action=show#head-f95f8528a2cf87256a33bd3042b3ae595a0bf5e5 >
Received on Wednesday, 11 February 2009 09:55:45 UTC