Re: A character is in the eye of the beholder
Prof J Larmouth wrote:
> The discussion on this combining/canonical issue has so far has been about
> what the standards currently say, or rather do not say. It would be better
> to focus on what we WANT them to say.
> It would be nice if 10646/Unicode (or some other standard - I don't really
> care) would prescribe precisely what combinations of specified combining and
> non-combining characters were "legal", and to group these so that all
> groups are expected (required?) to produce different screen displays, and
> sequences within a group to produce the same screen display.
I think both 10646 and Unicode say that any combination is "legal".
Unicode it a bit more explicit. The equivalence is what you seem to
expect. 10646 also defines "level 1" and "level 2" in which certain
combinations are restricted.
> It would then be a relatively easy matter to specify a canonical sequence
> for each group.
> I would expect that at level 1 there would be zero or one sequence (of a
> single "character") in each group, at level 2 there would be one sequence
> (of a single character) for some groups and multiple (different orders of
> combining characters) sequences for other groups, whilst at level 3
> we would get multiple sequences that differ even apart from order.
> If this provision could be made in some future version of 10646/Unicode,
> then we would have a canonical representation for any "legal" combination of
> 10646/Unicode "characters" which could be used in any comparison software
> that wishes to regard a "character" not as an encoding unit, but as
> something a little closer to normal human usage of the term.
The simplest transformation is to decompose all composites and sort all
combining characters that attach to a single base character in binary
order. This guarantees a unique and permanent canonical representation.
An alternative is to replace all those combinations that are defined
with the composite. This transformation is dependent on the version of
the standard, since new composite characters are being discovered from
time to time, but is still satisfactory.