Re: For review: Migrating to Unicode from Frank Ellermann on 2008-03-23 (www-international@w3.org from January to March 2008)

From: Frank Ellermann <nobody@xyzzy.claranet.de>
Date: Sun, 23 Mar 2008 21:26:10 +0100
To: www-international@w3.org
Message-ID: <fs6e8a$t8p$1@ger.gmane.org>
Hi John and Erik,
 
> Strictly speaking, iso-8859-1 is not the Content-Type. It is the
> charset, which is part of the Content-Type header.

ACK, that's what I should have said, wondering about a rule which
C0 or even C1 set is used for charset=iso-8859-1.  I read the ECMA
versions hoping that they are close enough to the ISO standards

ECMA 94 covers Latin-1 .. 4 related to old (1986) corresponding
iso-8859-1 .. 4 standards.  For Latin-1 it defines the usual 96
G1 characters, the fixed SP + DEL, and the usual 94 US-ASCII G0
characters.  Otherwise it says that it's typically used together
in an ECMA 35 (ISO 2022) or 43 (ISO 4873) framework.

Ignoring ECMA 35 as hopeless case to figure out what the C0 and
C1 sets for iso-8859-1 are supposed to be I looked into ECMA 43.

That defines that you minimally need ESC, e.g. ISO-IR 104, to get
the ESC-magic (including the ESC @ .. ESC _ 7bit variants of C1
controls, notably CSI, as far as there are any C1 controls "in"
Latin-1 - that would be important for Richard's article, where
he talks about the "presence of ESC" in charset identifications).

ECMA 43 also requires that control characters defined in ECMA 48
(ISO 6429) for C0 must not be used in any C1-set, and from there
I can trust any CR and LF in Latin-1 documents uses the octets
specified in ISO-IR 1.  Somehow I'm unable to find a requirement
that iso-8859 in fact uses ISO-IR 1, and not only ISO-IR 104, or
not any other C0 set with ESC where it should be:

http://www.itscj.ipsj.or.jp/ISO-IR/2-5.htm

ECMA 43 guarantees that 0x0E and 0x0F are not supposed to do odd
things, they are unused.  That's fine, I don't care about SI and
SO wrt iso-8859, I want to know why CR, LF, HT, FF, and a few
other C0 controls are what I want, when as John said C1 controls
are not bound to some known set for iso-8859-1 (for his argument
that windows-1252 is an "extension" and not only a "variation").

Naively, if the C1 controls are not bound, the C0 controls also
are not bound, and I have only ESC assuming an ECMA 43 framework.

For Unicode it's clear, it follows ECMA 48, giving us the normal
C0 controls including ESC, CR, LF, but also the normal C1 set
with among others NEL 0x85, and removing the former IND 0x84 -
the latter was fixed in Unicode some years ago, I hope it will
be also removed in the net-utf8 RFC before this gets its number.

For iso-8859 it's unclear, or I miss a clue, better than ECMA 43.

> The only other C0 character that is "processed" by some
> programs is NUL. MSIE simply deletes this(!).

Legacy documents could do something with "ANSI.SYS" control
sequences, trying to set colours can make sense in a static
document.  That's where finding ESC might be not good enough
to guess that it's ISO 2022 as proposed in Richard's article.

Such tricks would typically be used with legacy charsets cp437,
cp850, cp858, or similar, not windows-1252 or iso-8859-1, but
they are all identical for documents limited to US-ASCII, and
Richard's ESC magic could fails if it's only ESC [, i.e. CSI.

> In other charsets, some of these ISO 2022 features are used,
> of course.

Sure, but I assumed a simpler ECMA 43 (ISO 4873) framework for
my points.  Richard's proposal to look for SI and SO to catch
popular ISO 2022 charsets is fine.  ESC is also okay, but then
he better excludes 7bit CSI, or maybe all 7bit variants of C1.

John wrote:
| The C0 character set of ISO 646 is so entrenched that it does
| not even have its own unique name, and even nigglers like me
| wind up speaking as if US-ASCII had 128 rather than 95 
| characters

Some weeks ago I didn't know that there is a zoo of C0 sets,
and that ISO-IR 1 is what I meant when I said C0 before, with
ISO-IR 6 to get the 94 visible US ASCII characters.  I'd use
RFC 20 as normative reference, that takes care of all issues 
with (today) obscure variants.

| Synchronous Idle, anyone?

No, thanks, I'm not going to read what the USR Courier manual
says about software flow control again in this millennium.  But
I recall when ACK, NAK, DLE, SYN, etc. used to be relevant.

| otherwise we'd be typing ^W and not ^D or ^Z to signal an end
| of file from the keyboard

Some IBM charsets rotated SUB-FS-DEL, resulting in 0x1A (^Z) FS,
0x1C (^\) DEL, and 0x7F (^?) SUB.  That oddity is still relevant
for ICU and CharmapML, SUB ending up at 0x7F instead of 0x1A.

IBM didn't go so far as implement it in PC DOS, and AFAIK nobody
else did.  IIRC ^Z is EOF, because SUB was used to fill the last
sector of CP/M files, a kind of mandatory padding, degenerated
into one ^Z for MS and PC DOS text files, with some versions of
COMMAND.COM refusing to interpret the last line of BAT files if
there was no EOF.  One tool I still use today offers to define
and do interesting things with EOFIN and EOFOUT, with defaults
suited for old DOS BAT files.  

Unix ^D for ETX is a slightly different story, unfortunately not
used on DOS + OS/2 + NT platforms, where ^C or ^Z might work.

 Frank
Received on Sunday, 23 March 2008 20:24:32 UTC