RE: [Moderator Action] Re: BOM & Unicode editors

Michael is a little overzealous in dismissing our support of various forms
of Unicode:
1. Both Notepad on Win2000 and Word2000 on any system support input/output
as Big-Endian UTF-16 plain text (with BOM).
2. FrontPage (2000 and perhaps 98) allow open/save of HTML files as UTF-8.

Chris Pratley
Group Program Manager
Microsoft Word

-----Original Message-----
From: Michael (michka) Kaplan [mailto:michka@trigeminal.com]
Sent: June 5, 2000 11:02 PM
To: www-international@w3.org
Subject: [Moderator Action] Re: BOM & Unicode editors

I think that the move is a very good thing. We need standards like this. :-)



Unfortunately, since Microsoft does not currently support any endian system

other than Little Endian, you probably need to know the Microsoft one if you

want to work with Windows 2000....



 >ANSI (actually it means MBCS using the system default code page)

 >Unicode (Little Endian, actually it means UCS-2)

 >Unicode Big Endian (also means UCS-2, I believe? At least for RISC

processors, etc.)

 >UTF-8



There has long been controversy over the fact that MS products use "Unicode"

to mean UCS-2 and consider UTF-8 to be a multibyte encoding. I think it

mostly stems from the fact that the Unicode APIs of NT have always supported

UCS-2. The discrpency is compounded by several problems:



1) Most other (non-MS) products and OSes usually mean to use UTF-8 when

referring to Unicode.

2) Many Microsoft products like IIS and FrontPage do not handle Unicode

files at all

3) Products like IIS only handle UTF-8 only in the 5.0 version





michka





----- Original Message -----

From: "Asmus Freytag" <asmusf@ix.netcom.com>

To: "Michael (michka) Kaplan" <michka@trigeminal.com>; "Martin J. Duerst"

<duerst@w3.org>

Sent: Thursday, May 11, 2000 4:13 PM

Subject: Re: BOM & Unicode editors





 > We are now moving (at least within Unicode) to a consistent terminology

 >

 > UTF-8

 > UTF-16 (Endianess dependent, usually uses BOM)

 > UTF-16BE (known to be big endian, no BOM)

 > UTF-16LE (known to be little endian, no BOM)

 > UTF-32 (restricted to codes 0000-10FFFF)

 >

 > For the generic UTF-16 there is one logical designation, two physical

 > manifestations of opposite byte order. Unfortunately there is no term

 > for the actual physical representation, since the two other terms not

 > only designate a specific byte order, but also imply the absense of a

 > BOM character - furthermore, when you are actually processing

 > the data, the endianness of interest is not so much whether it's little

 > endian or big endian, but rather whether its same endian or opposite

 > endian.

 >

 > At 02:24 PM 5/11/00 +0900, you wrote:

 > >In Windows 2000 notepad, the option to save your files as any of the

 > >following dour formats exists:

 > >

 > >ANSI (actually it means MBCS using the system default code page)

 > >Unicode Little Endian (actually it means UCS-2)

 > >Unicode Big Endian (also means UCS-2, I believe? At least for RISC

 > >processors, etc.)

 > >UTF-8

 > >

 > >The latter three do indeed contain byte order marks, if for no other

reason

 > >than reopening the file allows notepad to read it properly and not guess

 > >about the encoding.

 > >

 > >FrontPage 2000 does not support the middle two, but it supports any

 > >supported MBCS code page on the system and UTF-8... with no byte mark

 > >required. But they mark encoding with other means.

 > >

 > >But for a program like notepad, the ability to open the file, save it,

and

 > >re-open it pretty much requires the byte mark.

 > >

 > >michka

 > >

 > >

 > >----- Original Message -----

 > >From: "Chris Lilley" <chris@w3.org>

 > >To: "Asmus Freytag" <asmusf@ix.netcom.com>

 > >Cc: "Saba Sundaramurthy" <ssundaramurthy@verisign.com>;

 > ><mozilla-i18n@mozilla.org>; <www-international@w3.org>;

 > ><i18n-prog@acoin.com>

 > >Sent: Wednesday, May 10, 2000 1:43 AM

 > >Subject: Re: BOM & Unicode editors

 > >

 > >

 > > >

 > > >

 > > > Asmus Freytag wrote:

 > > > >

 > > > > At 04:55 PM 5/9/00 -0700, Saba Sundaramurthy wrote:

 > > > > >     Is this something all editors that save files in Unicode or

UTF-8

 > >are

 > > > > >required to do? Can I depend on the presence of this marker in my

code?

 > > > >

 > > > > No, it's not a requirement, but it's a convention followed by quite

a

 > >few

 > > > > tools,

 > > > > because otherwise it's harder to use the same .txt extension for

both

 > >ASCII and

 > > > > Unicode (and also it helps to mark the byte order, of course).

 > > >

 > > > This is all fine and well for UTF-16, but what about UTF-8 ? why does

the

 > > > byte order matter?

 > > >

 > > > > I would recommend that you look for it in your code, if you plan to

read

 > >UTF-16

 > > > > files.

 > > >

 > > > And for UTF-8 files?

 > > >

 > > > --

 > > > Chris

 > > > /* the i18n-prog homepage is at:               */

 > > > /* http://www.acoin.com/i18n/i18n-prog.htm     */

 > > > /* See the page for removal instructions, etc. */

 > > >

 >

Received on Tuesday, 6 June 2000 21:51:27 UTC