Re: Unicode Normalization

Henri,

On Feb 5, 2009, at 8:44 PM, Benjamin wrote:

> If the various key strokes that can produce ü on European text input
> methods produced different code point sequences, it would be rightly
> considered a defect in the input methods.

You keep brining up the great superiority of European text input  
systems, but you're being way too presumptuous about this. I have yet  
to encounter an input system that entirely avoided non-NFC input (or  
non-NFD input if you prefer). Also as has been raised repeatedly there  
are no normative standards that require this one way or the other.  
Unicode addresses the comparison of canonical strings. Nowhere does it  
address the input of canonical strings. I agree with you that input  
systems should be addressed in the future. However, even once that  
happened it still does not necessarily solve the problem of comparison  
of canonically equivalent strings since those strings being compared  
might still be NFC or NFD.

> On the other hand, very
> complex input cooking between keystrokes and the code points
> communicated to an application accepting text input already exist:
> consider cooking many Romaji keystrokes into one Kanji.

But this has nothing to do with canonical normalization. No one here  
has said that normalization cannot be done on input, however, there's  
no authority out there requiring it. And again even if input systems  
were required to normalize this wouldn't fix the issue of comparing  
canonically equivalent strings from differently normalized input  
systems. This could get settled in favor of NFC or NFD, but that could  
take years or decades if it ever happened.

>
> If input methods for the languages you mention are inconsistent in
> their ordering of combining marks where the ordering of marks is
> visually indistinguishable, that's a defect of those input methods.

This is not only about combining marks. two strings can be different  
representations of the same canonically equivalent string and have no  
combining marks at all. And how can you say it is a defect of those  
input systems when there is no authoritative norms telling those  
systems to produce one normalized form or another (while at the same  
time producing a parser that is supposed to conform to Unicode yet  
when comparing two canonically equivalent strings returns that they  
are distinct; this looks like the pot calling the kettle black).

> Cooking the input to yield canonically ordered output should be a
> minor feat considering the infrastructure that already exists for e.g.
> Japanese text input methods and the visual rendering integration that
> e.g. Mac OS X does when you type the umlaut and the u separately for
> ü.

Yet Mac OS X has input systems that will gladly produce non-NFC and  
non-NFD strings, so where's an example of an operating environment  
where this is solved in terms of input systems?

> The right place to fix is the input methods--not software further
> down the chain processing text. After all, text is written fewer times
> than it is read. Furthermore, if you count GUI environments that
> handle text input, the number of systems where the fix needs to be
> applied is relatively small--just like the number of browser engines
> is relatively small, which is often used as an argument for putting
> complexity into browser engines.

Wouldn't this also suggest that HTML parsers should stop trying to  
repair broken HTML source? After all it would be far better for HTML  
producers to simply produce correct code than to slow down the parsing  
and fix all those errors. In the canonical normalization case, these  
are not even errors by any stretch of the imagination. Unicode doesn't  
require the text to be normalized upon input. What Unicode does  
address is the concept of canonical equivalence and normalizing  
strings for comparison.

Take the example I gave before:

1) Ệ (U+1EC6)
2) Ê (U+00CA) (U+0323)
3) Ẹ (U+1EB8) (U+0302)
4) E (U+0045) (U+0323) (U+0302)
5) E (U+0045) (U+0302) (U+0323)

Only one of these is NFC. Only one of these is NFD. However, I can  
produce several of these representations of the same canonically  
equivalent string on Mac OS X. Where's the input system stellar  
example where this is already fixed?

Take care,
Rob

Received on Friday, 6 February 2009 19:54:04 UTC