- From: Keld J|rn Simonsen <keld@dkuug.dk>
- Date: Thu, 17 Oct 1996 20:23:34 +0200
- To: Martin J Duerst <mduerst@ifi.unizh.ch>
- Cc: rosenne@NetVision.net.il, www-international@w3.org
Martin J Duerst writes: > Keld Simonsen wrote: > > >Jonathan Rosenne writes: > > > >> But there is another problem with internationalized names: UCS defines a > >> non-unique coding. Some composite characters have at least two valid > >> representations, the composed character and the base character followed > >> by diacritics. If there is more than one diacritics, their order is not > >> defined. The user often has no control over the coding. So before using > >> a name, it must be brought to a canonical representation. > > > >Well, UCS (=ISO/IEC 10646) does not define ambigeous encoding > >of characters, but Unicode does. Fortunately, HTML is defined in > >terms of ISO/IEC 10646. > > ISO 10646 does not define character semantics, and says nothing about > what combinations of codepoints should reasonably be treated as the > same characters on the application level and for the user. Yes, true. This is part of what I meant with my words, that 10646 does not define ambigeous encoding of character. A character can only be coded in one way, and this makes things simpler, as it removes the problems of multiple encoding of characters, as done in Unicode. > Even if character semantics is missing from ISO 10646, combining > characters are clearly defined and mentionned, because of their > relevance with respect to implementation levels, and for example > in Appendix B. > > A theoretical interpretation (which Keld seems to be taking) could > say that because ISO 10646 does not say that > LATIN CAPITAL LETTER A WITH GRAVE > and the sequence of > LATIN CAPITAL LETTER A and COMBINING GRAVE ACCENT > are equivalent, and because it calls all three of them graphic characters, > the two things are different, and there is no ambiguity. Well, 10646 says that LATIN CAPITAL LETTER A WITH GRAVE and the sequence of LATIN CAPITAL LETTER A and COMBINING GRAVE ACCENT are not the same character, as the first is only one character and the latter is two characters. This is not a theoretical interpretation of 10646, but what the standard says. > Such an interpretation may not conflict with ISO 10646, but it clearly > does not help any user. ISO 10646 also does not prohibit to collapse > these two representations for the benefit of the user. I would rather say that for the benefit of the user you should only encode a character in one way, and that is the encoding of 10646. You should not engage in artificial decomposition of characters, that only complicates things. Keld
Received on Thursday, 17 October 1996 14:23:56 UTC