W3C home > Mailing lists > Public > www-international@w3.org > April to June 2000

Re: BOM & Unicode editors

From: Yung-Fong Tang <ftang@netscape.com>
Date: Sat, 13 May 2000 13:40:08 -0700
Message-ID: <391DBDA7.751114A2@netscape.com>
To: "Martin J. Duerst" <duerst@w3.org>
CC: Saba Sundaramurthy <ssundaramurthy@verisign.com>, mozilla-i18n@mozilla.org, www-international@w3.org, i18n-prog@acoin.com
Also, I have a UTF-8 valuator. You can upload a file to see it is UTF-8 or not.
See http://people.netscape.com/ftang/i18n.html

"Martin J. Duerst" wrote:

> Hello Saba,
>
> For some more information on UTF-8, please see
> http://www.ifi.unizh.ch/mml/mduerst/papers/PDF/IUC11-UTF-8.pdf.
>
> There are some errors in the slide on page 5, but
> they are not very relevant here.
>
> The paper in particular shows how easy it is to automatically
> detect UTF-8 based on its specific byte patterns. This can
> mostly be done on the fly, i.e. a decoder starts with the
> assumption that it reads only ASCII and decides whether it's
> the local legacy encoding or UTF-8 once the first bytes
> with the 8th bit set are seen.
>
> One big problem of using the BOM as a 'magic number' for UTF-8
> also shouldn't go unmentionned here:
>
> UTF-8 without a BOM has the very important property that it
> encodes ASCII as ASCII, and everything else as something else.
> An ASCII file therefore is automatically UTF-8. All the nice
> things that you can do with text files can be done with UTF-8,
> too. However, once there is a BOM on a file, an ASCII file is
> no longer ASCII, and very simple operations such as an Unix
> 'cat' fail.
>
> Regards,   Martin.
>
> At 00/05/09 16:55 -0700, Saba Sundaramurthy wrote:
> >Hi,
> >
> >1)    Playing with text editors (FrontPage 2000 and Notepad) in Windows NT
> >and Windows 2000, I noticed that when ever the contents are saved unicode or
> >UTF-8 there is a marker FEFF placed at the beginning of the file. Inspecting
> >this marker can give information about the byte ordering of the machine and
> >also if the following bytes are Unicode or UTF-8.
> >
> >     Is this something all editors that save files in Unicode or UTF-8 are
> >required to do? Can I depend on the presence of this marker in my code?
> >
> >2)      Are there any editors available on unix to allow you to save text in
> >Unicode or UTF-8?
> >
> >Thanks in advance,
> >-Saba
Received on Saturday, 13 May 2000 16:40:19 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Tuesday, 2 June 2009 19:16:55 GMT