[Prev][Next][Index][Thread]

Re: Internationalized CLASS attributes



Martin J Duerst writes:

> Keld Simonsen wrote:
> 
> >Jonathan Rosenne writes:
> >
> >> But there is another problem with internationalized names: UCS defines a
> >> non-unique coding. Some composite characters have at least two valid
> >> representations, the composed character and the base character followed
> >> by diacritics. If there is more than one diacritics, their order is not
> >> defined. The user often has no control over the coding. So before using
> >> a name, it must be brought to a canonical representation.
> >
> >Well, UCS (=ISO/IEC 10646) does not define ambigeous encoding
> >of characters, but Unicode does. Fortunately, HTML is defined in
> >terms of ISO/IEC 10646.
> 
> ISO 10646 does not define character semantics, and says nothing about
> what combinations of codepoints should reasonably be treated as the
> same characters on the application level and for the user.

Yes, true. This is part of what I meant with my words, that 10646
does not define ambigeous encoding of character. A character can only
be coded in one way, and this makes things simpler, as it removes the
problems of multiple encoding of characters, as done in Unicode.

> Even if character semantics is missing from ISO 10646, combining
> characters are clearly defined and mentionned, because of their
> relevance with respect to implementation levels, and for example
> in Appendix B.
> 
> A theoretical interpretation (which Keld seems to be taking) could
> say that because ISO 10646 does not say that
> 	LATIN CAPITAL LETTER A WITH GRAVE
> and the sequence of
> 	LATIN CAPITAL LETTER A and COMBINING GRAVE ACCENT
> are equivalent, and because it calls all three of them graphic characters,
> the two things are different, and there is no ambiguity.

Well, 10646 says that 
 	LATIN CAPITAL LETTER A WITH GRAVE
and the sequence of
 	LATIN CAPITAL LETTER A and COMBINING GRAVE ACCENT
are not the same character, as the first is only one character
and the latter is two characters. This is not a theoretical 
interpretation of 10646, but what the standard says.

> Such an interpretation may not conflict with ISO 10646, but it clearly
> does not help any user. ISO 10646 also does not prohibit to collapse
> these two representations for the benefit of the user.

I would rather say that for the benefit of the user you
should only encode a character in one way, and that is the encoding
of 10646. You should not engage in artificial decomposition
of characters, that only complicates things.

Keld


Follow-Ups: