Re: A character is in the eye of the beholder
Jonathan Rosenne wrote:
>Prof J Larmouth wrote:
>> If this provision could be made in some future version of 10646/Unicode,
>> then we would have a canonical representation for any "legal" combination of
>> 10646/Unicode "characters" which could be used in any comparison software
>> that wishes to regard a "character" not as an encoding unit, but as
>> something a little closer to normal human usage of the term.
>The simplest transformation is to decompose all composites and sort all
>combining characters that attach to a single base character in binary
It is a little bit more complicated, you are not allowed to reorder
combining characters that go on the same side of the base character.
But it is still very well defined.
>This guarantees a unique and permanent canonical representation.
>An alternative is to replace all those combinations that are defined
>with the composite. This transformation is dependent on the version of
>the standard, since new composite characters are being discovered from
>time to time, but is still satisfactory.
It's not that they are discovered. There are, e.g., many known diacritic
combinations in Hebrew, and except for very few cases used in Yiddish,
none of them is or (hopefully) will ever be encoded as precomposed.
The problem arises therefore not when such things are newly discovered,
but when they are added to the standard.