Re: BOM & Unicode editors from Martin J. Duerst on 2000-05-12 (www-international@w3.org from April to June 2000)

From: Martin J. Duerst <duerst@w3.org>
Date: Fri, 12 May 2000 18:35:54 +0900
To: Saba Sundaramurthy <ssundaramurthy@verisign.com>, mozilla-i18n@mozilla.org, www-international@w3.org, i18n-prog@acoin.com
Message-Id: <4.2.0.58.J.20000512181006.00c96a80@sh.w3.mag.keio.ac.jp>

Hello Saba,

For some more information on UTF-8, please see
http://www.ifi.unizh.ch/mml/mduerst/papers/PDF/IUC11-UTF-8.pdf.

There are some errors in the slide on page 5, but
they are not very relevant here.

The paper in particular shows how easy it is to automatically
detect UTF-8 based on its specific byte patterns. This can
mostly be done on the fly, i.e. a decoder starts with the
assumption that it reads only ASCII and decides whether it's
the local legacy encoding or UTF-8 once the first bytes
with the 8th bit set are seen.

One big problem of using the BOM as a 'magic number' for UTF-8
also shouldn't go unmentionned here:

UTF-8 without a BOM has the very important property that it
encodes ASCII as ASCII, and everything else as something else.
An ASCII file therefore is automatically UTF-8. All the nice
things that you can do with text files can be done with UTF-8,
too. However, once there is a BOM on a file, an ASCII file is
no longer ASCII, and very simple operations such as an Unix
'cat' fail.

Regards,   Martin.

At 00/05/09 16:55 -0700, Saba Sundaramurthy wrote:
>Hi,
>
>1)    Playing with text editors (FrontPage 2000 and Notepad) in Windows NT
>and Windows 2000, I noticed that when ever the contents are saved unicode or
>UTF-8 there is a marker FEFF placed at the beginning of the file. Inspecting
>this marker can give information about the byte ordering of the machine and
>also if the following bytes are Unicode or UTF-8.
>
>     Is this something all editors that save files in Unicode or UTF-8 are
>required to do? Can I depend on the presence of this marker in my code?
>
>2)      Are there any editors available on unix to allow you to save text in
>Unicode or UTF-8?
>
>Thanks in advance,
>-Saba

Received on Friday, 12 May 2000 05:31:29 UTC