Re: Internationalized CLASS attributes from Martin J Duerst on 1996-10-17 (www-international@w3.org from October to December 1996)

From: Martin J Duerst <mduerst@ifi.unizh.ch>
Date: Thu, 17 Oct 1996 17:51:42 +0100 (MET)
To: keld@dkuug.dk (Keld J|rn Simonsen)
Cc: rosenne@NetVision.net.il, www-international@w3.org
Message-ID: <"josef.ifi..980:17.09.96.16.51.57"@ifi.unizh.ch>

Keld Simonsen wrote:

>Jonathan Rosenne writes:
>
>> But there is another problem with internationalized names: UCS defines a
>> non-unique coding. Some composite characters have at least two valid
>> representations, the composed character and the base character followed
>> by diacritics. If there is more than one diacritics, their order is not
>> defined. The user often has no control over the coding. So before using
>> a name, it must be brought to a canonical representation.
>
>Well, UCS (=ISO/IEC 10646) does not define ambigeous encoding
>of characters, but Unicode does. Fortunately, HTML is defined in
>terms of ISO/IEC 10646.

ISO 10646 does not define character semantics, and says nothing about
what combinations of codepoints should reasonably be treated as the
same characters on the application level and for the user.

Even if character semantics is missing from ISO 10646, combining
characters are clearly defined and mentionned, because of their
relevance with respect to implementation levels, and for example
in Appendix B.

A theoretical interpretation (which Keld seems to be taking) could
say that because ISO 10646 does not say that
	LATIN CAPITAL LETTER A WITH GRAVE
and the sequence of
	LATIN CAPITAL LETTER A and COMBINING GRAVE ACCENT
are equivalent, and because it calls all three of them graphic characters,
the two things are different, and there is no ambiguity.

Such an interpretation may not conflict with ISO 10646, but it clearly
does not help any user. ISO 10646 also does not prohibit to collapse
these two representations for the benefit of the user.

Regards,	Martin.

Received on Thursday, 17 October 1996 11:52:22 UTC