Re: Internationalized CLASS attributes from Martin J Duerst on 1996-10-24 (www-international@w3.org from October to December 1996)

From: Martin J Duerst <mduerst@ifi.unizh.ch>
Date: Thu, 24 Oct 1996 10:42:57 +0100 (MET)
To: keld@dkuug.dk (Keld J|rn Simonsen)
Cc: rosenne@NetVision.net.il, www-international@w3.org
Message-ID: <"josef.ifi..482:24.09.96.09.43.03"@ifi.unizh.ch>
Keld Simonsen wrote:

>Martin J Duerst writes:
>
>> Well, somebody may encode A-with-GRAVE, because (s)he sees it as
>> a single character, and it appears as such on the keyboard.
>> And somebody else may encode A followed by GRAVE, because
>> the GRAVE is on a separate key, and e.g. as a tone can go on
>> any vowel (of course, what the user enters and what the system
>> does may well be two different things). And strictly by ISO 10646,
>> these might be two different things. ISO 10646 does no prescribe
>> that an A followed by a combining GRAVE is illegal, or should not
>> be used, just because A-with-GRAVE exists as a separate codepoint.
>
>I don't think it is the user who normally encodes things, it is the
>designers of the system.

I have mentionned this fact above in parenteses. I am glad you
expand it for the benefits of the readers of this list.

>What you describe here is the way
>chracters are typed in, and that is quite different from how it
>is repersented internally. For example the A follwed with GRAVe is
>normally types in on Latin keyboaeds with *first* entering a
>dead key "GRAVE" and then the A.

Yes, this is popular, because it was that way on typewriters,
for mechanical reasons. But people not used to typewriters
in many cases prefer to type the accent after the base letter,
as I guess most of them write it that way, and think of the
accent as an addition to the base letter.

>The input system needs to combine
>this into an A-GRAVE, or as you suggest, as *first* an A and then
>a combining grave, that is it intelligently have to reverse the order
>of the base letter and the accent.

Not much intelligence needed. That all can be done by simple tables.

>> A user does not have or see ISO 10646 characters. A user
>> sees and deals with things on the screen and on paper.
>> ISO 10646 characters are abstract entities, and we have
>> to make sure, where possible, that the application takes
>> provisions to reconcile these abstract entities with the
>> expectations of the user if the expectations of the user
>> are different.
>
>I think that is very hard to do. How can you find out what
>a user percieves a character to be? On the keyboards I know of
>of Latin, you often have dead keys to enter accented characters,
>so either if the user percieves ths as two characters, or perceives
>it as one character, it needs to be keyed in the smae way.
>
>I find that it is much more relevant that the system codes the
>information in one unambigeous way, and the is the resonsibility of the
>system designer of the keyboard interface, in conjunction with the
>designers of the rest of the system.

Of course. What I wanted to say above is that although different
users may have different ideas about A-grave, thinking about
it as one or two characters, or as something inbetween, or probably
not having a very explicit idea about it anyway, every single
user thinks that all A-grave that (s)he sees on any form of
paper or on any computer screen are the same.

If we have systems which for good reasons encode A-grave as
a single codepoint, and others that use two codepoints for
in their circumstances equally valid reasons, we run the danger
that at some point, a system may say that these two are different
although for the user, they are the same.

This seems less to be a problem if we inport text from one
tool (e.g. editor) to another, because we can reasonably
assume that a good Unicode/ISO10646 editor will normalize
the A-grave one way or another (which way is not relevant),
so that the user gets consistent behaviour.

We get more into problems if we go to applications such as
compilers, interpreters, and formatters (as with HTML, where
the current discussion started), which concentrate on other
things and would like not to deal with character equivalence.
We get into even more problems if ISO10646 is used for
external identification, where for performance reasons
we want very quick comparisons that don't need normalization.
In order to make this work, we have to "prenormalize", and
for this, we have to decide on doing it one way or another,
for each character or combination in question.

 
>I agree that for some scripts, you need combining characters.
>But for almost all of Latin based languages, you have all you
>need in form of whole characters in 10646. There are a few
>examples of Latin letters that are not encoded in 10646, and for that
>the only way to represent that information is with
>the use of combining characters, agreed. But the occurrances of those
>combinaion would be very minimal compared to what can be coded
>directly in 10646.

The important words here are "almost all" and "minimal". Some
people believe that this can be changed to "all" and "none",
just by adding more precombined characters. The fact is that
it cannot be done, there are several thousand languages
written with the Latin script, and linguists invent new
combinations according to their needs. The addition of
new combinations, however, has the undesired effect to
further marginalize those languages that need combining
characters, leading to a very bad vicious cycle.


Regards,	Martin.
Received on Thursday, 24 October 1996 04:43:48 UTC