Re: For review: Migrating to Unicode

Frank Ellermann scripsit:

> Naively, if the C1 controls are not bound, the C0 controls also
> are not bound, and I have only ESC assuming an ECMA 43 framework.

In theory, perhaps; in practice, no.  The C0 set of ISO 646, or parts
of it, are by default in effect; no C1 set is.

> For Unicode it's clear, it follows ECMA 48, giving us the normal
> C0 controls including ESC, CR, LF, but also the normal C1 set
> with among others NEL 0x85, and removing the former IND 0x84 -
> the latter was fixed in Unicode some years ago, I hope it will
> be also removed in the net-utf8 RFC before this gets its number.

Actually, Unicode is indifferent to which Cx sets are used with it.
The names of the characters in normal sets are carried in UnicodeData.txt
for convenience, but they aren't normative in Unicode.

> IIRC ^Z is EOF, because SUB was used to fill the last
> sector of CP/M files, a kind of mandatory padding, degenerated
> into one ^Z for MS and PC DOS text files, with some versions of
> COMMAND.COM refusing to interpret the last line of BAT files if
> there was no EOF.  

CP/M picked it up from RT-11, which picked it up from OS/8.  None
of those systems kept exact file lengths around, only length in
disk blocks.  However, filling out the block with ^Zs was just
an application convention -- no more than one was ever needed.
In OS/8, the same convention was used for object code files
as well as text.

> Unix ^D for ETX is a slightly different story, unfortunately not
> used on DOS + OS/2 + NT platforms, where ^C or ^Z might work.

In any case, ^W (logical end of medium) would have been the Right Thing.

-- 
Your worships will perhaps be thinking          John Cowan
that it is an easy thing to blow up a dog?      http://www.ccil.org/~cowan
[Or] to write a book?
    --Don Quixote, Introduction                 cowan@ccil.org

Received on Sunday, 23 March 2008 21:49:19 UTC