- From: Mark Davis <mark.davis@jtcsv.com>
- Date: Mon, 03 Dec 2001 08:55:31 -0800
- To: ietf-charsets@iana.org
- Cc: markus.scherer@jtcsv.com
I agree with Martin that knowing supersets (and subsets) would be useful information. If there were another format for aliases, that said that some other set is a superset or subset or close to (but not the same as) this set, that would be quite useful. However, in order to avoid corrupting data, any such superset/subset relationship must be actually be true, and true on all platforms that implement both charsets. That seems obvious, but there are pitfalls -- character sets that may seem to be supersets may not, in reality, be such. For example, the cp1252 family as implemented on Windows, will to silently map any "unassigned" character in 0x80..0x9F into U+0080..U+009F, and back again. When a new character was added, the mapping was changed, as for Euro. So the new map is *not* a superset of the old -- one will get distinctly different mappings for the same Unicode character. So the most that can be said about the relationship between windows-windows-1252-1998 and windows-1252-2000 is that they are "close to" one another, *not* that the latter is a subset of the former. To make it even more confusing, other versions of cp1252 may have super/subset relations. For example, java-Cp1252-1.1_P.xml is a full subset of aix-IBM_1252-4.3.6. For more examples, see: http://oss.software.ibm.com/cvs/icu/~checkout~/charset/data/xml/windows-1252 -2000.xml (for the windows definition) http://oss.software.ibm.com/cvs/icu/~checkout~/charset/data/xml/java-Cp1252- 1.3_P.xml (for the java definition) (the older versions are not posted) For a list of the data we have collected so far on character set mappings, see http://oss.software.ibm.com/icu/charset/index.html. In particular, there is a generated analysis of different sets on http://oss.software.ibm.com/icu/charset/roundtripIndex.html. Note: while the IANA registry is the best that we have, assuming that an IANA ID always mean the same thing will result in data corruption. Take 1252, for example: aix-IBM_1252-4.3.6 is identical to windows-1252-2000, but only if fallback mappings are excluded. java-Cp1252-1.3_P and glibc-CP1252-2.1.2 are 98.05% the same, not identical. Even with the 8859 series there are differences -- search on roundtripIndex.html for 8859_7, for example. When we get to East Asian sets, there are a considerable number of variants. Mark ————— Ὀλίγοι ἔμφονες πολλῶν ἀφρόνων φοβερώτεροι — Πλάτωνος [For transliteration, see http://oss.software.ibm.com/cgi-bin/icu/tr] http://www.macchiato.com ----- Original Message ----- From: "Martin Duerst" <duerst@w3.org> To: "Mark Davis" <mark.davis@us.ibm.com>; "Jonathan Rosenne" <rosenne@qsm.co.il> Cc: <ietf-charsets@iana.org> Sent: Friday, November 30, 2001 22:25 Subject: Re: ISO 8859 -8:1999 > At 18:04 01/11/30 -0800, Mark Davis wrote: > > >Supersets will still cause data corruption problems. See > >http://www.unicode.org/unicode/reports/tr22/, especially Section 1.2.1 > > > >Mark > > There are many cases where knowing a superset relationship would > help. We might propose to IANA that they maintain this information, > or we might just ask upcomming registrations to include that in > their registration information. > > Regards, Martin. >
Received on Tuesday, 4 December 2001 09:36:39 UTC