Re: Definition of charset "macintosh"

Whenever a character in a charset changes, that can cause data to be
corrupted. It is especially important nowadays, when the internal character
set is Unicode/10646, and XML (or HTML) are used to serialize the text in a
different character set. Here is what happens.

1. An implementation with the old definition emitting data marked as
"Macintosh" will escape a Euro sign (€) as € while leaving the
currency sign (¤) alone. An implementation with the new definition receiving
that data will correctly handle the Euro, but misinterpret the currency sign
as a Euro.

- While the currency sign is little used (and was badly conceived in the
first place), it is used. For example, both in Windows and on Java it is
used as a stand-in for the currency sign in a currency pattern string.
Changing that to Euro would cause even apparently unrelated currency values
such as dollars to appear as Euros.

2. An implementation with the new definition emitting data marked as
"Macintosh" will escape a currency sign (¤) as ¤ while leaving the Euro
sign (€)alone. An implementation with the old definition receiving that data
will correctly handle the currency sign, but misinterpret the Euro as a
currency sign.

- This is even more serious. All new data with Euros will be misinterpreted
on older implementations.

To sum it up, changing any character in a set can be dangerous. The best way
to avoid these situations is:

A. Define fully qualified names for all versions of character sets. The TR22
naming conventions are strongly recommended
(http://www.unicode.org/unicode/reports/tr22/). Encourage implementations to
use the fully-qualified names.

B. One can also have a partially-qualified name (e.g. "Macintosh") as an
alias for one of these. And that alias could change over time to be the
latest version. Implementers can also use the partially-qualified character
set names in circumstances where robust data conversion is not as important.

Mark
—————

Πόλλ’ ἠπίστατο ἔργα, κακῶς δ’ ἠπίστατο πάντα — Ὁμήρου Μαργίτῃ
[For transliteration, see http://oss.software.ibm.com/cgi-bin/icu/tr]

http://www.macchiato.com

----- Original Message -----
From: "Harald Tveit Alvestrand" <harald@alvestrand.no>
To: "Deborah Goldsmith" <goldsmit@apple.com>; "IETF Charsets Mailing List"
<ietf-charsets@iana.org>
Sent: Monday, January 14, 2002 06:40
Subject: Re: Definition of charset "macintosh"


> nobody seems to have commented on this....
>
> if "macintosh" is used in the industry to refer to a charset that has the
> euro sign in it, then I, personally, think that we should update the
> registration to point out that fact.
>
> In a more rational world, a new "macintosh-euro" charset would be
> registered, but the currency symbol is the single most useless character I
> know about - redefining its codepoint does not cause a great deal of harm
> to the world.
>
> What do others think?
>
>             Harald
>
> --On 14. desember 2001 11:17 -0800 Deborah Goldsmith <goldsmit@apple.com>
> wrote:
>
> > The IANA registration for the charset "macintosh", which represents the
> > Mac OS Roman character set, currently refers to RFC 1345.
> >
> > Since RFC 1345 was published, the definition of the MacRoman character
> > set has changed. In particular, the code point 0xDB, which was formerly
> > U+00A4 CURRENCY SIGN, was redefined to be U+20AC EURO SIGN.
> >
> > What would be the appropriate course of action to deal with this
> > discrepancy? Registering a new "macintosh-euro" character set seems like
> > overkill. Apple would prefer to just redefine the IANA-registered
> > character set "macintosh" to conform to the new definition of MacRoman.
> > Is that allowed? If so, what procedure should be followed?
> >
> > The definition of MacRoman can be found at:
> >
> > http://www.unicode.org/Public/MAPPINGS/VENDORS/APPLE/ROMAN.TXT
> >
> > Would it be appropriate to refer to that rather than to a (revised) RFC?
> >
> > Thanks,
> >
> > Deborah Goldsmith
> > Manager, Fonts & Language Kits
> > Apple Computer, Inc.
> > goldsmith@apple.com
> >
> >
> >
>
>
>

Received on Monday, 14 January 2002 11:59:14 UTC