- From: Frank Ellermann <nobody@xyzzy.claranet.de>
- Date: Sun, 23 Mar 2008 21:26:10 +0100
- To: www-international@w3.org
Hi John and Erik, > Strictly speaking, iso-8859-1 is not the Content-Type. It is the > charset, which is part of the Content-Type header. ACK, that's what I should have said, wondering about a rule which C0 or even C1 set is used for charset=iso-8859-1. I read the ECMA versions hoping that they are close enough to the ISO standards ECMA 94 covers Latin-1 .. 4 related to old (1986) corresponding iso-8859-1 .. 4 standards. For Latin-1 it defines the usual 96 G1 characters, the fixed SP + DEL, and the usual 94 US-ASCII G0 characters. Otherwise it says that it's typically used together in an ECMA 35 (ISO 2022) or 43 (ISO 4873) framework. Ignoring ECMA 35 as hopeless case to figure out what the C0 and C1 sets for iso-8859-1 are supposed to be I looked into ECMA 43. That defines that you minimally need ESC, e.g. ISO-IR 104, to get the ESC-magic (including the ESC @ .. ESC _ 7bit variants of C1 controls, notably CSI, as far as there are any C1 controls "in" Latin-1 - that would be important for Richard's article, where he talks about the "presence of ESC" in charset identifications). ECMA 43 also requires that control characters defined in ECMA 48 (ISO 6429) for C0 must not be used in any C1-set, and from there I can trust any CR and LF in Latin-1 documents uses the octets specified in ISO-IR 1. Somehow I'm unable to find a requirement that iso-8859 in fact uses ISO-IR 1, and not only ISO-IR 104, or not any other C0 set with ESC where it should be: http://www.itscj.ipsj.or.jp/ISO-IR/2-5.htm ECMA 43 guarantees that 0x0E and 0x0F are not supposed to do odd things, they are unused. That's fine, I don't care about SI and SO wrt iso-8859, I want to know why CR, LF, HT, FF, and a few other C0 controls are what I want, when as John said C1 controls are not bound to some known set for iso-8859-1 (for his argument that windows-1252 is an "extension" and not only a "variation"). Naively, if the C1 controls are not bound, the C0 controls also are not bound, and I have only ESC assuming an ECMA 43 framework. For Unicode it's clear, it follows ECMA 48, giving us the normal C0 controls including ESC, CR, LF, but also the normal C1 set with among others NEL 0x85, and removing the former IND 0x84 - the latter was fixed in Unicode some years ago, I hope it will be also removed in the net-utf8 RFC before this gets its number. For iso-8859 it's unclear, or I miss a clue, better than ECMA 43. > The only other C0 character that is "processed" by some > programs is NUL. MSIE simply deletes this(!). Legacy documents could do something with "ANSI.SYS" control sequences, trying to set colours can make sense in a static document. That's where finding ESC might be not good enough to guess that it's ISO 2022 as proposed in Richard's article. Such tricks would typically be used with legacy charsets cp437, cp850, cp858, or similar, not windows-1252 or iso-8859-1, but they are all identical for documents limited to US-ASCII, and Richard's ESC magic could fails if it's only ESC [, i.e. CSI. > In other charsets, some of these ISO 2022 features are used, > of course. Sure, but I assumed a simpler ECMA 43 (ISO 4873) framework for my points. Richard's proposal to look for SI and SO to catch popular ISO 2022 charsets is fine. ESC is also okay, but then he better excludes 7bit CSI, or maybe all 7bit variants of C1. John wrote: | The C0 character set of ISO 646 is so entrenched that it does | not even have its own unique name, and even nigglers like me | wind up speaking as if US-ASCII had 128 rather than 95 | characters Some weeks ago I didn't know that there is a zoo of C0 sets, and that ISO-IR 1 is what I meant when I said C0 before, with ISO-IR 6 to get the 94 visible US ASCII characters. I'd use RFC 20 as normative reference, that takes care of all issues with (today) obscure variants. | Synchronous Idle, anyone? No, thanks, I'm not going to read what the USR Courier manual says about software flow control again in this millennium. But I recall when ACK, NAK, DLE, SYN, etc. used to be relevant. | otherwise we'd be typing ^W and not ^D or ^Z to signal an end | of file from the keyboard Some IBM charsets rotated SUB-FS-DEL, resulting in 0x1A (^Z) FS, 0x1C (^\) DEL, and 0x7F (^?) SUB. That oddity is still relevant for ICU and CharmapML, SUB ending up at 0x7F instead of 0x1A. IBM didn't go so far as implement it in PC DOS, and AFAIK nobody else did. IIRC ^Z is EOF, because SUB was used to fill the last sector of CP/M files, a kind of mandatory padding, degenerated into one ^Z for MS and PC DOS text files, with some versions of COMMAND.COM refusing to interpret the last line of BAT files if there was no EOF. One tool I still use today offers to define and do interesting things with EOFIN and EOFOUT, with defaults suited for old DOS BAT files. Unix ^D for ETX is a slightly different story, unfortunately not used on DOS + OS/2 + NT platforms, where ^C or ^Z might work. Frank
Received on Sunday, 23 March 2008 20:24:32 UTC