[Prev][Next][Index][Thread]

Re: Internationalized CLASS attributes



Keld Simonsen wrote:

>Martin J Duerst writes:

>> Even if character semantics is missing from ISO 10646, combining
>> characters are clearly defined and mentionned, because of their
>> relevance with respect to implementation levels, and for example
>> in Appendix B.
>>
>> A theoretical interpretation (which Keld seems to be taking) could
>> say that because ISO 10646 does not say that
>> 	LATIN CAPITAL LETTER A WITH GRAVE
>> and the sequence of
>> 	LATIN CAPITAL LETTER A and COMBINING GRAVE ACCENT
>> are equivalent, and because it calls all three of them graphic characters,
>> the two things are different, and there is no ambiguity.
>
>Well, 10646 says that 
> 	LATIN CAPITAL LETTER A WITH GRAVE
>and the sequence of
> 	LATIN CAPITAL LETTER A and COMBINING GRAVE ACCENT
>are not the same character, as the first is only one character
>and the latter is two characters. This is not a theoretical 
>interpretation of 10646, but what the standard says.

Does it say explicitly that an application is forbidden to
treat the two representations as equivalent, or to normalize
to one or the other? Or does it say that a system is forbidden
(on level 3) to use the sequence of LATIN CAPITAL LETTER A and
COMBINING GRAVE ACCENT? If yes, can you tell me in which chapter
(please no page numbers, I only have a Japanese translation)
it says so?

>> Such an interpretation may not conflict with ISO 10646, but it clearly
>> does not help any user. ISO 10646 also does not prohibit to collapse
>> these two representations for the benefit of the user.
>
>I would rather say that for the benefit of the user you
>should only encode a character in one way, and that is the encoding
>of 10646. You should not engage in artificial decomposition
>of characters, that only complicates things.

I agree that a system should only encode characters in one way.
But just the way you say it suggests that there are more than
one ways. Also, where in one system or language, using precomposed
characters is the natural way to do things, in another system
using decomposition may be the natural way to do things.

For the above example, immagine a tonal language such as
Chinese. For many applications, it may be more convenient
to be able to detach tone accents by removing characters
than to do conversions from one codepoint to another.

Also, for an application that has to use combining characters
for languages and special applications that don't have all their
combinations precomposed in ISO 10646, it may be much more
straightforward to have everything composing, and no precomposed
codepoints. Software has to deal with composition anyway if
it deals with pointed Arabic and Hebrew, or with Indic languages.

Also, actual implementation experience and recent discussions
show that the effort to deal with composition is mostly
overestimated.

Regards,	Martin.


References: