Re: For review: Migrating to Unicode from Frank Ellermann on 2008-03-24 (www-international@w3.org from January to March 2008)

From: Frank Ellermann <nobody@xyzzy.claranet.de>
Date: Mon, 24 Mar 2008 19:01:28 +0100
To: www-international@w3.org
Message-ID: <fs8q51$sis$1@ger.gmane.org>

Erik van der Poel wrote:

> Instead of looking at ECMA and ISO standards, another
> way to look at this is to note that we are having this
> discussion on a mailing list @w3.org.

Neutral territory unless we try to screw with Charmod ;-)

> one of the main standards that uses "Content-Type" is
> HTTP, i.e. RFC 2616.

Yes, the IETF http-bis WG has its mailing list hosted by
the W3C, 2616bis will link directly to the IANA registry:

> http://www.iana.org/assignments/character-sets

> For iso-8859-1, this refers to RFC 1345, which has the
> following data for iso-8859-1

RFC 1345 is old and allegedly buggy and informational, and
Martin is one of the two IANA charset registry experts, we
could fix "it" (RFC 1345 and/or the registry) if necessary.

>   &charset ISO_8859-1:1987
>   &rem source: ECMA registry
>   &alias iso-ir-100
>   &g1esc x2d41 &g2esc x2e41 &g3esc x2f41
[...]
[...mnemonics from 0x00 up to 0xFF for C0, G0, C1, G1...]

> So, in theory, HTTP user agents should use the above C0
> and C1 sets, along with the indicated G1, G2 and G3.

I think the indicated escapes are used to invoke the
"right hand part" of Latin-1 as either G1, G2, or G3,
compare <http://www.itscj.ipsj.or.jp/ISO-IR/2-3.htm>.
For an example see RFC 2157:

| For ISO 8859-1, the relevant escape sequence will be:
|  ESC 28 42
|        ASCII in G0
|  ESC 2D 41
|        ISO-IR-100 in G1
|  ESC 21 41
|        High control character set in C1
|  ESC 7E
|        Locking shift 1 Right

That binds 94 char.s from ISO-IR 6 (left hand part) as G0,
96 char.s from ISO-IR 100 (right hand part) as G1.  The
ESC 21 41 could be a typo, and should be ESC 22 43 for 
ISO-IR 77 (?)  I'll ask Harald if that is an erratum.  

Apparently RFC 2157 does what you and John said for RFC 822
to mumble gateways (I know nothing about Mixer and X.400),
but like RFC 1345 it binds the C1 set for iso-8859-1.

> And C1 is ignored completely, using windows-1252 instead.

> There is no point living in ECMA/ISO/RFC theory land. I
> prefer to live in the real world, and test the popular
> applications to see what they actually do.

I think specifications need to be correct, and if I don't
like them I shouldn't pretend to follow them, e.g., all my
Web pages are us-ascii, windows-1252, or utf-8, not a single
iso-8859-1 page.  

I'm pessimistic about using windows-1252 while claiming that
it is iso-8859-1 in say certificates or similar applications,
popular or otherwise, d****d Latin-1 default in RFC 2616 :-( 

 Frank

Received on Monday, 24 March 2008 17:59:45 UTC