- From: Robert J Burns <rob@robburns.com>
- Date: Fri, 6 Feb 2009 13:51:20 -0600
- To: benjo316@gmail.com
- Cc: Jonathan Kew <jonathan@jfkew.plus.com>, Anne van Kesteren <annevk@opera.com>, Aryeh Gregor <Simetrical+w3c@gmail.com>, public-i18n-core@w3.org, W3C Style List <www-style@w3.org>
- Message-Id: <046E8B6D-BB17-4B41-8195-6C84EBBD2734@robburns.com>
Henri, On Feb 5, 2009, at 8:44 PM, Benjamin wrote: > If the various key strokes that can produce ü on European text input > methods produced different code point sequences, it would be rightly > considered a defect in the input methods. You keep brining up the great superiority of European text input systems, but you're being way too presumptuous about this. I have yet to encounter an input system that entirely avoided non-NFC input (or non-NFD input if you prefer). Also as has been raised repeatedly there are no normative standards that require this one way or the other. Unicode addresses the comparison of canonical strings. Nowhere does it address the input of canonical strings. I agree with you that input systems should be addressed in the future. However, even once that happened it still does not necessarily solve the problem of comparison of canonically equivalent strings since those strings being compared might still be NFC or NFD. > On the other hand, very > complex input cooking between keystrokes and the code points > communicated to an application accepting text input already exist: > consider cooking many Romaji keystrokes into one Kanji. But this has nothing to do with canonical normalization. No one here has said that normalization cannot be done on input, however, there's no authority out there requiring it. And again even if input systems were required to normalize this wouldn't fix the issue of comparing canonically equivalent strings from differently normalized input systems. This could get settled in favor of NFC or NFD, but that could take years or decades if it ever happened. > > If input methods for the languages you mention are inconsistent in > their ordering of combining marks where the ordering of marks is > visually indistinguishable, that's a defect of those input methods. This is not only about combining marks. two strings can be different representations of the same canonically equivalent string and have no combining marks at all. And how can you say it is a defect of those input systems when there is no authoritative norms telling those systems to produce one normalized form or another (while at the same time producing a parser that is supposed to conform to Unicode yet when comparing two canonically equivalent strings returns that they are distinct; this looks like the pot calling the kettle black). > Cooking the input to yield canonically ordered output should be a > minor feat considering the infrastructure that already exists for e.g. > Japanese text input methods and the visual rendering integration that > e.g. Mac OS X does when you type the umlaut and the u separately for > ü. Yet Mac OS X has input systems that will gladly produce non-NFC and non-NFD strings, so where's an example of an operating environment where this is solved in terms of input systems? > The right place to fix is the input methods--not software further > down the chain processing text. After all, text is written fewer times > than it is read. Furthermore, if you count GUI environments that > handle text input, the number of systems where the fix needs to be > applied is relatively small--just like the number of browser engines > is relatively small, which is often used as an argument for putting > complexity into browser engines. Wouldn't this also suggest that HTML parsers should stop trying to repair broken HTML source? After all it would be far better for HTML producers to simply produce correct code than to slow down the parsing and fix all those errors. In the canonical normalization case, these are not even errors by any stretch of the imagination. Unicode doesn't require the text to be normalized upon input. What Unicode does address is the concept of canonical equivalence and normalizing strings for comparison. Take the example I gave before: 1) Ệ (U+1EC6) 2) Ê (U+00CA) (U+0323) 3) Ẹ (U+1EB8) (U+0302) 4) E (U+0045) (U+0323) (U+0302) 5) E (U+0045) (U+0302) (U+0323) Only one of these is NFC. Only one of these is NFD. However, I can produce several of these representations of the same canonically equivalent string on Mac OS X. Where's the input system stellar example where this is already fixed? Take care, Rob
Received on Friday, 6 February 2009 19:54:05 UTC