- From: Erik van der Poel <erikv@google.com>
- Date: Sat, 22 Mar 2008 07:10:49 -0700
- To: "Frank Ellermann" <hmdmhdfmhdjmzdtjmzdtzktdkztdjz@gmail.com>
- Cc: www-international@w3.org
> >> | Windows-1252, an extension of ISO-8859-1 > > >> Is "extension" strictly correct ? Or is it only a "variation" ? > > > Extension is strictly correct. ISO 8859-1 does not assign meaning > > to the bytes 0x80-0x9F (the overall framework may assign them > > meaning as control characters), but Windows-1252 does. > > That's odd, isn't it ? When I use iso-8859-1 as Content-Type Strictly speaking, iso-8859-1 is not the Content-Type. It is the charset, which is part of the Content-Type header. > I certainly want more than only the minimal C0 set with ESC. I'm > going to use CR and LF (and maybe HT, FF, and others) without > explicitly invoking a "non-minimal" C0 set. I don't believe there are very many programs that process "Content-Type" and do meaningful things with C0 characters other than CR, LF and HT once they have decided that they are dealing with iso-8859-1. The only other C0 character that is "processed" by some programs is NUL. MSIE simply deletes this(!). > I also assume that > it's ECMA 43 level 1 without SS2 and SS3, let alone any level 3 > locking shifts. I never tried to invoke a G2 or G3 within a > document claiming to be iso-8859-1. Again, the programs mentioned above don't deal with SS2, SS3, G2 and G3 when they have decided that the charset is iso-8859-1. There may be a few programs that bother with these things, but there probably isn't much data (content, email, etc) out there that tries to take advantage of these features of ISO 2022 inside iso-8859-1 data. In other charsets, some of these ISO 2022 features are used, of course. Some of them have additional rules, above and beyond those specified in ISO 2022. One example of a charset based on ISO 2022 is iso-2022-jp. An early English description of this appears in RFC 1468. Another example is euc-jp, which uses SS2 and SS3. Yet more examples are iso-2022-kr, iso-2022-cn and euc-tw. Coming back to iso-8859-1, the byte values 0x80 to 0x9F are typically treated as though the charset were windows-1252. I believe many programs that process "Content-Type" will treat us-ascii, iso-8859-1, windows-1252 and all of their aliases as windows-1252, i.e. the "superset". Another family of charsets with a superset relationship is GB2312, GBK and GB18030. Erik
Received on Saturday, 22 March 2008 14:11:27 UTC