Re: For review: Migrating to Unicode

>  >> | Windows-1252, an extension of ISO-8859-1
>
>  >> Is "extension" strictly correct ?  Or is it only a "variation" ?
>
>  > Extension is strictly correct.  ISO 8859-1 does not assign meaning
>  > to the bytes 0x80-0x9F (the overall framework may assign them
>  > meaning as control characters), but Windows-1252 does.
>
>  That's odd, isn't it ?  When I use iso-8859-1 as Content-Type

Strictly speaking, iso-8859-1 is not the Content-Type. It is the
charset, which is part of the Content-Type header.

>  I certainly want more than only the minimal C0 set with ESC.  I'm
>  going to use CR and LF (and maybe HT, FF, and others) without
>  explicitly invoking a "non-minimal" C0 set.

I don't believe there are very many programs that process
"Content-Type" and do meaningful things with C0 characters other than
CR, LF and HT once they have decided that they are dealing with
iso-8859-1. The only other C0 character that is "processed" by some
programs is NUL. MSIE simply deletes this(!).

>  I also assume that
>  it's ECMA 43 level 1 without SS2 and SS3, let alone any level 3
>  locking shifts.  I never tried to invoke a G2 or G3 within a
>  document claiming to be iso-8859-1.

Again, the programs mentioned above don't deal with SS2, SS3, G2 and
G3 when they have decided that the charset is iso-8859-1. There may be
a few programs that bother with these things, but there probably isn't
much data (content, email, etc) out there that tries to take advantage
of these features of ISO 2022 inside iso-8859-1 data.

In other charsets, some of these ISO 2022 features are used, of
course. Some of them have additional rules, above and beyond those
specified in ISO 2022. One example of a charset based on ISO 2022 is
iso-2022-jp. An early English description of this appears in RFC 1468.
Another example is euc-jp, which uses SS2 and SS3. Yet more examples
are iso-2022-kr, iso-2022-cn and euc-tw.

Coming back to iso-8859-1, the byte values 0x80 to 0x9F are typically
treated as though the charset were windows-1252. I believe many
programs that process "Content-Type" will treat us-ascii, iso-8859-1,
windows-1252 and all of their aliases as windows-1252, i.e. the
"superset".

Another family of charsets with a superset relationship is GB2312, GBK
and GB18030.

Erik

Received on Saturday, 22 March 2008 14:11:27 UTC