A character is in the eye of the beholder

=========================================================================
     Prof J Larmouth,  University Director of Telematic Applications,
     IT Institute,  University of Salford,  Salford M5 4WT,  England.

J.Larmouth @ ITI.SALFORD.AC.UK                Telephone: +44 161 745 5657
                                                    Fax: +44 161 745 8169
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

To:     www-international@w3.org

Subject:      A character is in the eye of the beholder

Martin Duerst wrote:

>Well, somebody may encode A-with-GRAVE, because (s)he sees it as
>a single character, and it appears as such on the keyboard.
>And somebody else may encode A followed by GRAVE, because
>the GRAVE is on a separate key ...
>
>... And whether a user means "that character" or is more familliar to
>think about it as two different things is important for the local
>user interface, but as both of these things appear (or should appear)
>in the same way on the screen, even if ISO 10646 does not specify
>any equivalence, it makes sense to specify equivalences on the
>application level.

I think this is a key remark.  We see here a view for input of two key
presses (equals "characters"),  and for output of a single glyph (equals
"character").

The discussion on this combining/canonical issue has so far has been about
what the standards currently say,  or rather do not say.  It would be better
to focus on what we WANT them to say.

It would be nice if 10646/Unicode (or some other standard - I don't really
care) would prescribe precisely what combinations of specified combining and
non-combining characters were "legal",  and to group these so that all
groups are expected (required?) to produce different screen displays,  and
sequences within a group to produce the same screen display.

It would then be a relatively easy matter to specify a canonical sequence
for each group.

I would expect that at level 1 there would be zero or one sequence (of a
single "character") in each group,  at level 2 there would be one sequence
(of a single character) for some groups and multiple (different orders of
combining characters) sequences for other groups,  whilst at level 3
we would get multiple sequences that differ even apart from order.

If this provision could be made in some future version of 10646/Unicode,
then we would have a canonical representation for any "legal" combination of
10646/Unicode "characters" which could be used in any comparison software
that wishes to regard a "character" not as an encoding unit,  but as
something a little closer to normal human usage of the term.

John L

Received on Monday, 21 October 1996 11:57:19 UTC