Re: ISO 8859 -8:1999 from Mark Davis on 2001-12-01 (ietf-charsets@w3.org from October to December 2001)

From: Mark Davis <mark.davis@macchiato.com>
Date: Sat, 01 Dec 2001 05:55:20 -0800
To: Martin Duerst <duerst@w3.org>, Mark Davis <mark.davis@us.ibm.com>, Jonathan Rosenne <rosenne@qsm.co.il>
Cc: ietf-charsets@iana.org
Message-id: <000601c17cd8$6b1b83b0$93d1ea0c@c1340594a>

I agree with Martin that knowing supersets (and subsets) would be useful
information. If there were another format for aliases, that said that some
other set is a superset or subset or close to (but not the same as) this
set, that would be quite useful. However, in order to avoid corrupting data,
any such superset/subset relationship must be actually be true, and true on
all platforms that implement both charsets.

That seems obvious, but there are pitfalls -- character sets that may seem
to be supersets may not, in reality, be such. For example, the cp1252 family
as implemented on Windows, will to silently map any "unassigned" character
in 0x80..0x9F into U+0080..U+009F, and back again. When a new character was
added, the mapping was changed, as for Euro.  So the new map is *not* a
superset of the old -- one will get distinctly different mappings for the
same Unicode character. So the most that can be said about the relationship
between windows-windows-1252-1998 and windows-1252-2000 is that they are
"close to" one another, *not* that the latter is a subset of the former.

To make it even more confusing, other versions of cp1252 may have
super/subset relations. For example, java-Cp1252-1.1_P.xml is a full subset
of java-Cp1252-1.3_P.xml.

For more examples, see:
http://oss.software.ibm.com/cvs/icu/~checkout~/charset/data/xml/windows-1252
-2000.xml (for the windows definition)
http://oss.software.ibm.com/cvs/icu/~checkout~/charset/data/xml/java-Cp1252-
1.3_P.xml (for the java definition)
(the older versions are not posted)

For a list of the data we have collected so far on character set mappings,
see http://oss.software.ibm.com/icu/charset/index.html. In particular, there
is a generated analysis of different sets on
http://oss.software.ibm.com/icu/charset/roundtripIndex.html.

Note: while the IANA registry is the best that we have, assuming that an
IANA ID always mean the same thing will result in data corruption. Take
1252, for example:

aix-IBM_1252-4.3.6 is identical to windows-1252-2000, but only if fallback
mappings are excluded.
java-Cp1252-1.3_P and glibc-CP1252-2.1.2 are 98.05% the same, not identical.

Even with the 8859 series there are differences -- search on
roundtripIndex.html for 8859_7, for example. When we get to East Asian sets,
there are a considerable number of variants.

Mark
—————

Ὀλίγοι ἔμφονες πολλῶν ἀφρόνων φοβερώτεροι — Πλάτωνος
[For transliteration, see http://oss.software.ibm.com/cgi-bin/icu/tr]

http://www.macchiato.com
----- Original Message -----
From: "Martin Duerst" <duerst@w3.org>
To: "Mark Davis" <mark.davis@us.ibm.com>; "Jonathan Rosenne"
<rosenne@qsm.co.il>
Cc: <ietf-charsets@iana.org>
Sent: Friday, November 30, 2001 22:25
Subject: Re: ISO 8859 -8:1999

> At 18:04 01/11/30 -0800, Mark Davis wrote:
>
> >Supersets will still cause data corruption problems. See
> >http://www.unicode.org/unicode/reports/tr22/, especially Section 1.2.1
> >
> >Mark
>
> There are many cases where knowing a superset relationship would
> help. We might propose to IANA that they maintain this information,
> or we might just ask upcomming registrations to include that in
> their registration information.
>
> Regards,    Martin.
>

Received on Thursday, 20 December 2001 09:47:18 UTC