W3C home > Mailing lists > Public > www-international@w3.org > July to September 2000

Re: [Moderator Action] Re: BOM & Unicode editors

From: Michael \(michka\) Kaplan <michka@trigeminal.com>
Date: Mon, 11 Sep 2000 16:16:41 +0900
Message-Id: <4.2.0.58.J.20000911161626.034d4760@sh.w3.mag.keio.ac.jp>
To: www-international@w3.org
MS FrontPage 2000 support of UTF-8 is what keeps my site up and running! How
else to support Tamil/Hindi/Georgian/Armenian? :-)

Don't forget Microsoft Jet, whose text IISAM will support UTF-7, UTF-8,
UTF-16LE, and UTF-16BE. :-)

michka
(who is a huge fan of Jet and its text IISAM, which is very useful in his
work!)


----- Original Message -----
From: "Chris Pratley" <chrispr@MICROSOFT.com>
To: "'Michael (michka) Kaplan'" <michka@trigeminal.com>;
<www-international@w3.org>
Sent: Tuesday, June 06, 2000 6:50 PM
Subject: RE: [Moderator Action] Re: BOM & Unicode editors


 > Michael is a little overzealous in dismissing our support of various forms
 > of Unicode:
 > 1. Both Notepad on Win2000 and Word2000 on any system support input/output
 > as Big-Endian UTF-16 plain text (with BOM).
 > 2. FrontPage (2000 and perhaps 98) allow open/save of HTML files as UTF-8.
 >
 > Chris Pratley
 > Group Program Manager
 > Microsoft Word
 >
 > -----Original Message-----
 > From: Michael (michka) Kaplan [mailto:michka@trigeminal.com]
 > Sent: June 5, 2000 11:02 PM
 > To: www-international@w3.org
 > Subject: [Moderator Action] Re: BOM & Unicode editors
 >
 > I think that the move is a very good thing. We need standards like this.
:-)
 >
 >
 >
 > Unfortunately, since Microsoft does not currently support any endian
system
 >
 > other than Little Endian, you probably need to know the Microsoft one if
you
 >
 > want to work with Windows 2000....
 >
 >
 >
 > >ANSI (actually it means MBCS using the system default code page)
 >
 > >Unicode (Little Endian, actually it means UCS-2)
 >
 > >Unicode Big Endian (also means UCS-2, I believe? At least for RISC
 >
 > processors, etc.)
 >
 > >UTF-8
 >
 >
 >
 > There has long been controversy over the fact that MS products use
"Unicode"
 >
 > to mean UCS-2 and consider UTF-8 to be a multibyte encoding. I think it
 >
 > mostly stems from the fact that the Unicode APIs of NT have always
supported
 >
 > UCS-2. The discrpency is compounded by several problems:
 >
 >
 >
 > 1) Most other (non-MS) products and OSes usually mean to use UTF-8 when
 >
 > referring to Unicode.
 >
 > 2) Many Microsoft products like IIS and FrontPage do not handle Unicode
 >
 > files at all
 >
 > 3) Products like IIS only handle UTF-8 only in the 5.0 version
 >
 >
 >
 >
 >
 > michka
 >
 >
 >
 >
 >
 > ----- Original Message -----
 >
 > From: "Asmus Freytag" <asmusf@ix.netcom.com>
 >
 > To: "Michael (michka) Kaplan" <michka@trigeminal.com>; "Martin J. Duerst"
 >
 > <duerst@w3.org>
 >
 > Sent: Thursday, May 11, 2000 4:13 PM
 >
 > Subject: Re: BOM & Unicode editors
 >
 >
 >
 >
 >
 > > We are now moving (at least within Unicode) to a consistent terminology
 >
 > >
 >
 > > UTF-8
 >
 > > UTF-16 (Endianess dependent, usually uses BOM)
 >
 > > UTF-16BE (known to be big endian, no BOM)
 >
 > > UTF-16LE (known to be little endian, no BOM)
 >
 > > UTF-32 (restricted to codes 0000-10FFFF)
 >
 > >
 >
 > > For the generic UTF-16 there is one logical designation, two physical
 >
 > > manifestations of opposite byte order. Unfortunately there is no term
 >
 > > for the actual physical representation, since the two other terms not
 >
 > > only designate a specific byte order, but also imply the absense of a
 >
 > > BOM character - furthermore, when you are actually processing
 >
 > > the data, the endianness of interest is not so much whether it's little
 >
 > > endian or big endian, but rather whether its same endian or opposite
 >
 > > endian.
 >
 > >
 >
 > > At 02:24 PM 5/11/00 +0900, you wrote:
 >
 > > >In Windows 2000 notepad, the option to save your files as any of the
 >
 > > >following dour formats exists:
 >
 > > >
 >
 > > >ANSI (actually it means MBCS using the system default code page)
 >
 > > >Unicode Little Endian (actually it means UCS-2)
 >
 > > >Unicode Big Endian (also means UCS-2, I believe? At least for RISC
 >
 > > >processors, etc.)
 >
 > > >UTF-8
 >
 > > >
 >
 > > >The latter three do indeed contain byte order marks, if for no other
 >
 > reason
 >
 > > >than reopening the file allows notepad to read it properly and not
guess
 >
 > > >about the encoding.
 >
 > > >
 >
 > > >FrontPage 2000 does not support the middle two, but it supports any
 >
 > > >supported MBCS code page on the system and UTF-8... with no byte mark
 >
 > > >required. But they mark encoding with other means.
 >
 > > >
 >
 > > >But for a program like notepad, the ability to open the file, save it,
 >
 > and
 >
 > > >re-open it pretty much requires the byte mark.
 >
 > > >
 >
 > > >michka
 >
 > > >
 >
 > > >
 >
 > > >----- Original Message -----
 >
 > > >From: "Chris Lilley" <chris@w3.org>
 >
 > > >To: "Asmus Freytag" <asmusf@ix.netcom.com>
 >
 > > >Cc: "Saba Sundaramurthy" <ssundaramurthy@verisign.com>;
 >
 > > ><mozilla-i18n@mozilla.org>; <www-international@w3.org>;
 >
 > > ><i18n-prog@acoin.com>
 >
 > > >Sent: Wednesday, May 10, 2000 1:43 AM
 >
 > > >Subject: Re: BOM & Unicode editors
 >
 > > >
 >
 > > >
 >
 > > > >
 >
 > > > >
 >
 > > > > Asmus Freytag wrote:
 >
 > > > > >
 >
 > > > > > At 04:55 PM 5/9/00 -0700, Saba Sundaramurthy wrote:
 >
 > > > > > > Is this something all editors that save files in Unicode or
 >
 > UTF-8
 >
 > > >are
 >
 > > > > > >required to do? Can I depend on the presence of this marker in my
 >
 > code?
 >
 > > > > >
 >
 > > > > > No, it's not a requirement, but it's a convention followed by
quite
 >
 > a
 >
 > > >few
 >
 > > > > > tools,
 >
 > > > > > because otherwise it's harder to use the same .txt extension for
 >
 > both
 >
 > > >ASCII and
 >
 > > > > > Unicode (and also it helps to mark the byte order, of course).
 >
 > > > >
 >
 > > > > This is all fine and well for UTF-16, but what about UTF-8 ? why
does
 >
 > the
 >
 > > > > byte order matter?
 >
 > > > >
 >
 > > > > > I would recommend that you look for it in your code, if you plan
to
 >
 > read
 >
 > > >UTF-16
 >
 > > > > > files.
 >
 > > > >
 >
 > > > > And for UTF-8 files?
 >
 > > > >
 >
 > > > > --
 >
 > > > > Chris
 >
 > > > > /* the i18n-prog homepage is at: */
 >
 > > > > /* http://www.acoin.com/i18n/i18n-prog.htm */
 >
 > > > > /* See the page for removal instructions, etc. */
 >
 > > > >
 >
 > >
 >
 >
Received on Monday, 11 September 2000 03:55:29 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Tuesday, 2 June 2009 19:16:55 GMT