- From: Martin J Duerst <mduerst@ifi.unizh.ch>
- Date: Thu, 17 Oct 1996 17:51:42 +0100 (MET)
- To: keld@dkuug.dk (Keld J|rn Simonsen)
- Cc: rosenne@NetVision.net.il, www-international@w3.org
Keld Simonsen wrote: >Jonathan Rosenne writes: > >> But there is another problem with internationalized names: UCS defines a >> non-unique coding. Some composite characters have at least two valid >> representations, the composed character and the base character followed >> by diacritics. If there is more than one diacritics, their order is not >> defined. The user often has no control over the coding. So before using >> a name, it must be brought to a canonical representation. > >Well, UCS (=ISO/IEC 10646) does not define ambigeous encoding >of characters, but Unicode does. Fortunately, HTML is defined in >terms of ISO/IEC 10646. ISO 10646 does not define character semantics, and says nothing about what combinations of codepoints should reasonably be treated as the same characters on the application level and for the user. Even if character semantics is missing from ISO 10646, combining characters are clearly defined and mentionned, because of their relevance with respect to implementation levels, and for example in Appendix B. A theoretical interpretation (which Keld seems to be taking) could say that because ISO 10646 does not say that LATIN CAPITAL LETTER A WITH GRAVE and the sequence of LATIN CAPITAL LETTER A and COMBINING GRAVE ACCENT are equivalent, and because it calls all three of them graphic characters, the two things are different, and there is no ambiguity. Such an interpretation may not conflict with ISO 10646, but it clearly does not help any user. ISO 10646 also does not prohibit to collapse these two representations for the benefit of the user. Regards, Martin.
Received on Thursday, 17 October 1996 11:52:22 UTC