Re: Internationalized CLASS attributes
Martin J Duerst writes:
> Keld Simonsen wrote:
> >Martin J Duerst writes:
> >> Keld Simonsen wrote:
> >> >Martin J Duerst writes:
> >> Does it say explicitly that an application is forbidden to
> >> treat the two representations as equivalent, or to normalize
> >> to one or the other? Or does it say that a system is forbidden
> >> (on level 3) to use the sequence of LATIN CAPITAL LETTER A and
> >> COMBINING GRAVE ACCENT? If yes, can you tell me in which chapter
> >> (please no page numbers, I only have a Japanese translation)
> >> it says so?
> >No 10646 does not forbid that. But it is not defined in 10646, and you
> >can do it in an application. Byt why not just do as the standard
> >prescribes to encode a character, when you mean that character.
> >Then you are following international standards.
> Well, somebody may encode A-with-GRAVE, because (s)he sees it as
> a single character, and it appears as such on the keyboard.
> And somebody else may encode A followed by GRAVE, because
> the GRAVE is on a separate key, and e.g. as a tone can go on
> any vowel (of course, what the user enters and what the system
> does may well be two different things). And strictly by ISO 10646,
> these might be two different things. ISO 10646 does no prescribe
> that an A followed by a combining GRAVE is illegal, or should not
> be used, just because A-with-GRAVE exists as a separate codepoint.
I don't think it is the user who normally encodes things, it is the
designers of the system. What you describe here is the way
chracters are typed in, and that is quite different from how it
is repersented internally. For example the A follwed with GRAVe is
normally types in on Latin keyboaeds with *first* entering a
dead key "GRAVE" and then the A. The input system needs to combine
this into an A-GRAVE, or as you suggest, as *first* an A and then
a combining grave, that is it intelligently have to reverse the order
of the base letter and the accent. This is not a user decision but
a matter of the system design. The system designer may then chose
to convert the input stream into something which has an ambigeous
meaning according to the ISO standards, for example with combining
characters, or code it as a character in the 10646 repertoire, that
also then has well defined properties and can be processed in for
example sorting and character set conversion according to other
formally standardized specifications.
> >Well, if you have a 10646 character there is only one way
> >to encode it. There is no decomposition in 10646.
> A user does not have or see ISO 10646 characters. A user
> sees and deals with things on the screen and on paper.
> ISO 10646 characters are abstract entities, and we have
> to make sure, where possible, that the application takes
> provisions to reconcile these abstract entities with the
> expectations of the user if the expectations of the user
> are different.
I think that is very hard to do. How can you find out what
a user percieves a character to be? On the keyboards I know of
of Latin, you often have dead keys to enter accented characters,
so either if the user percieves ths as two characters, or perceives
it as one character, it needs to be keyed in the smae way.
I find that it is much more relevant that the system codes the
information in one unambigeous way, and the is the resonsibility of the
system designer of the keyboard interface, in conjunction with the
designers of the rest of the system.
> >> For the above example, immagine a tonal language such as
> >> Chinese. For many applications, it may be more convenient
> >> to be able to detach tone accents by removing characters
> >> than to do conversions from one codepoint to another.
> >Could it not just as conveniently be handled with the ordinary
> >characters of 10646?
> Combining characters are 10646 characters too, and indispensable
> for some languages. Calling everything else "ordinary" is not
> very friendly to these languages.
> Also, in some cases, no precombined characters are available,
> so "just handling it with 'ordinary'" characters is not possible.
I agree that for some scripts, you need combining characters.
But for almost all of Latin based languages, you have all you
need in form of whole characters in 10646. There are a few
examples of Latin letters that are not encoded in 10646, and for that
the only way to represent that information is with
the use of combining characters, agreed. But the occurrances of those
combinaion would be very minimal compared to what can be coded
directly in 10646.