Re: For review: Migrating to Unicode from Erik van der Poel on 2008-03-24 (www-international@w3.org from January to March 2008)

From: Erik van der Poel <erikv@google.com>
Date: Mon, 24 Mar 2008 06:51:42 -0700
To: "Frank Ellermann" <hmdmhdfmhdjmzdtjmzdtzktdkztdjz@gmail.com>
Cc: www-international@w3.org
Message-ID: <c07a32650803240651v24eb5d83qea09dcfa71583211@mail.gmail.com>
Frank,

Instead of looking at ECMA and ISO standards, another way to look at
this is to note that we are having this discussion on a mailing list
@w3.org. We have mentioned "Content-Type". In the context of w3.org,
one of the main standards that uses "Content-Type" is HTTP, i.e. RFC
2616. For charsets, this RFC refers to RFC 1700, which refers to:

ftp://ftp.isi.edu/in-notes/iana/assignments/character-sets/README

This, in turn, takes you to:

http://www.iana.org/assignments/character-sets

For iso-8859-1, this refers to RFC 1345, which has the following data
for iso-8859-1:

  &charset ISO_8859-1:1987
  &rem source: ECMA registry
  &alias iso-ir-100
  &g1esc x2d41 &g2esc x2e41 &g3esc x2f41
  &alias ISO_8859-1
  &alias ISO-8859-1
  &alias latin1
  &alias l1
  &alias IBM819
  &alias CP819
  &code 0
  NU SH SX EX ET EQ AK BL BS HT LF VT FF CR SO SI
  DL D1 D2 D3 D4 NK SY EB CN EM SB EC FS GS RS US
  SP ! " Nb DO % & ' ( ) * + , - . / 0 1 2 3 4 5 6 7 8 9 : ; < = > ?
  At A B C D E F G H I J K L M N O P Q R S T U V W X Y Z <( // )> '> _
  '! a b c d e f g h i j k l m n o p q r s t u v w x y z (! !! !) '? DT
  PA HO BH NH IN NL SA ES HS HJ VS PD PU RI S2 S3
  DC P1 P2 TS CC MW SG EG SS GC SC CI ST OC PM AC
  NS !I Ct Pd Cu Ye BB SE ': Co -a << NO -- Rg '-
  DG +- 2S 3S '' My PI .M ', 1S -o >> 14 12 34 ?I
  A! A' A> A? A: AA AE C, E! E' E> E: I! I' I> I:
  D- N? O! O' O> O? O: *X O/ U! U' U> U: Y' TH ss
  a! a' a> a? a: aa ae c, e! e' e> e: i! i' i> i:
  d- n? o! o' o> o? o: -: o/ u! u' u> u: y' th y:

So, in theory, HTTP user agents should use the above C0 and C1 sets,
along with the indicated G1, G2 and G3.

In practice, not very many HTTP user agents do meaningful things with
all of those. As I said, CR, LF and HT are the most important. And C1
is ignored completely, using windows-1252 instead.

There is no point living in ECMA/ISO/RFC theory land. I prefer to live
in the real world, and test the popular applications to see what they
actually do. I also look at data to see what is actually used.

Erik

On Mon, Mar 24, 2008 at 5:09 AM, Frank Ellermann
<nobody@xyzzy.claranet.de> wrote:
>
>  John Cowan wrote:
>
>   [C0 vs. C1 in iso-8859-1]
>
> > In theory, perhaps; in practice, no.  The C0 set of ISO 646,
>  > or parts of it, are by default in effect; no C1 set is.
>
>  Okay, I know that I can use CRLF in iso-8859-1 among others in
>  practice, but I'd expect at least a hint about this practical
>  default also in the standard.  Trying to implement this with
>  an explicit ESC ! @ likely won't work as expected in practice.
>
>  On the page <http://www.itscj.ipsj.or.jp/ISO-IR/2-5.htm> four
>  different C0 sets claim to be related to ISO 646.
>
>
>  > Unicode is indifferent to which Cx sets are used with it.
>  > The names of the characters in normal sets are carried in
>  > UnicodeData.txt for convenience, but they aren't normative
>  > in Unicode.
>
>  The book says that I may assume ECMA 48 (ISO 6429), and in
>  table 16.1 it claims that 10 control codes are "specified".
>  I don't know what this means, it's followed by a discussion
>  of u+0000 not belonging the ten "specified" control codes,
>  but in any case NEL u+0085 is "specified" (= one of the ten).
>
>
>  > filling out the block with ^Zs was just an application
>  > convention -- no more than one was ever needed.  In OS/8,
>  > the same convention was used for object code files as well
>  > as text.
>
>  I fear I missed OS/8, the oldest platforms I recall are /360,
>  TOPS/10, BS2000, and TR 440.  For the use of 0xF0 by format
>  tools I guess it is an urban legend that it is derived from
>  EBCDIC "V" = "virgin".
>
>
>  > ^W (logical end of medium) would have been the Right Thing.
>
>  For some uses of ^A .. ^Z such as Martin's example ^S they
>  could be mnemonics, S = suspend (XOFF, therefore Q = XON),
>  Z = last letter (therefore eof), R = reprint.
>
>  One year after <http://www.w3.org/People/cmsmcq/2007/C1.xml>
>  all this appears to be still as messy as twenty years ago :-(
>  But in RFC 20 almost 40 years ago it was still fine.
>
>   Frank
>
>
>
Received on Monday, 24 March 2008 13:52:21 UTC