W3C home > Mailing lists > Public > www-international@w3.org > April to June 2000

[Moderator Action] Re: BOM & Unicode editors

From: Michael \(michka\) Kaplan <michka@trigeminal.com>
Date: Tue, 06 Jun 2000 15:01:59 +0900
Message-Id: <4.2.0.58.J.20000606150150.03311590@sh.w3.mag.keio.ac.jp>
To: www-international@w3.org
I think that the move is a very good thing. We need standards like this. :-)

Unfortunately, since Microsoft does not currently support any endian system
other than Little Endian, you probably need to know the Microsoft one if you
want to work with Windows 2000....

 >ANSI (actually it means MBCS using the system default code page)
 >Unicode (Little Endian, actually it means UCS-2)
 >Unicode Big Endian (also means UCS-2, I believe? At least for RISC
processors, etc.)
 >UTF-8

There has long been controversy over the fact that MS products use "Unicode"
to mean UCS-2 and consider UTF-8 to be a multibyte encoding. I think it
mostly stems from the fact that the Unicode APIs of NT have always supported
UCS-2. The discrpency is compounded by several problems:

1) Most other (non-MS) products and OSes usually mean to use UTF-8 when
referring to Unicode.
2) Many Microsoft products like IIS and FrontPage do not handle Unicode
files at all
3) Products like IIS only handle UTF-8 only in the 5.0 version


michka


----- Original Message -----
From: "Asmus Freytag" <asmusf@ix.netcom.com>
To: "Michael (michka) Kaplan" <michka@trigeminal.com>; "Martin J. Duerst"
<duerst@w3.org>
Sent: Thursday, May 11, 2000 4:13 PM
Subject: Re: BOM & Unicode editors


 > We are now moving (at least within Unicode) to a consistent terminology
 >
 > UTF-8
 > UTF-16 (Endianess dependent, usually uses BOM)
 > UTF-16BE (known to be big endian, no BOM)
 > UTF-16LE (known to be little endian, no BOM)
 > UTF-32 (restricted to codes 0000-10FFFF)
 >
 > For the generic UTF-16 there is one logical designation, two physical
 > manifestations of opposite byte order. Unfortunately there is no term
 > for the actual physical representation, since the two other terms not
 > only designate a specific byte order, but also imply the absense of a
 > BOM character - furthermore, when you are actually processing
 > the data, the endianness of interest is not so much whether it's little
 > endian or big endian, but rather whether its same endian or opposite
 > endian.
 >
 > At 02:24 PM 5/11/00 +0900, you wrote:
 > >In Windows 2000 notepad, the option to save your files as any of the
 > >following dour formats exists:
 > >
 > >ANSI (actually it means MBCS using the system default code page)
 > >Unicode Little Endian (actually it means UCS-2)
 > >Unicode Big Endian (also means UCS-2, I believe? At least for RISC
 > >processors, etc.)
 > >UTF-8
 > >
 > >The latter three do indeed contain byte order marks, if for no other
reason
 > >than reopening the file allows notepad to read it properly and not guess
 > >about the encoding.
 > >
 > >FrontPage 2000 does not support the middle two, but it supports any
 > >supported MBCS code page on the system and UTF-8... with no byte mark
 > >required. But they mark encoding with other means.
 > >
 > >But for a program like notepad, the ability to open the file, save it,
and
 > >re-open it pretty much requires the byte mark.
 > >
 > >michka
 > >
 > >
 > >----- Original Message -----
 > >From: "Chris Lilley" <chris@w3.org>
 > >To: "Asmus Freytag" <asmusf@ix.netcom.com>
 > >Cc: "Saba Sundaramurthy" <ssundaramurthy@verisign.com>;
 > ><mozilla-i18n@mozilla.org>; <www-international@w3.org>;
 > ><i18n-prog@acoin.com>
 > >Sent: Wednesday, May 10, 2000 1:43 AM
 > >Subject: Re: BOM & Unicode editors
 > >
 > >
 > > >
 > > >
 > > > Asmus Freytag wrote:
 > > > >
 > > > > At 04:55 PM 5/9/00 -0700, Saba Sundaramurthy wrote:
 > > > > >     Is this something all editors that save files in Unicode or
UTF-8
 > >are
 > > > > >required to do? Can I depend on the presence of this marker in my
code?
 > > > >
 > > > > No, it's not a requirement, but it's a convention followed by quite
a
 > >few
 > > > > tools,
 > > > > because otherwise it's harder to use the same .txt extension for
both
 > >ASCII and
 > > > > Unicode (and also it helps to mark the byte order, of course).
 > > >
 > > > This is all fine and well for UTF-16, but what about UTF-8 ? why does
the
 > > > byte order matter?
 > > >
 > > > > I would recommend that you look for it in your code, if you plan to
read
 > >UTF-16
 > > > > files.
 > > >
 > > > And for UTF-8 files?
 > > >
 > > > --
 > > > Chris
 > > > /* the i18n-prog homepage is at:               */
 > > > /* http://www.acoin.com/i18n/i18n-prog.htm     */
 > > > /* See the page for removal instructions, etc. */
 > > >
 >
Received on Tuesday, 6 June 2000 01:55:00 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Tuesday, 2 June 2009 19:16:55 GMT