- From: Martin J. Duerst <duerst@w3.org>
- Date: Fri, 12 May 2000 18:35:54 +0900
- To: Saba Sundaramurthy <ssundaramurthy@verisign.com>, mozilla-i18n@mozilla.org, www-international@w3.org, i18n-prog@acoin.com
Hello Saba, For some more information on UTF-8, please see http://www.ifi.unizh.ch/mml/mduerst/papers/PDF/IUC11-UTF-8.pdf. There are some errors in the slide on page 5, but they are not very relevant here. The paper in particular shows how easy it is to automatically detect UTF-8 based on its specific byte patterns. This can mostly be done on the fly, i.e. a decoder starts with the assumption that it reads only ASCII and decides whether it's the local legacy encoding or UTF-8 once the first bytes with the 8th bit set are seen. One big problem of using the BOM as a 'magic number' for UTF-8 also shouldn't go unmentionned here: UTF-8 without a BOM has the very important property that it encodes ASCII as ASCII, and everything else as something else. An ASCII file therefore is automatically UTF-8. All the nice things that you can do with text files can be done with UTF-8, too. However, once there is a BOM on a file, an ASCII file is no longer ASCII, and very simple operations such as an Unix 'cat' fail. Regards, Martin. At 00/05/09 16:55 -0700, Saba Sundaramurthy wrote: >Hi, > >1) Playing with text editors (FrontPage 2000 and Notepad) in Windows NT >and Windows 2000, I noticed that when ever the contents are saved unicode or >UTF-8 there is a marker FEFF placed at the beginning of the file. Inspecting >this marker can give information about the byte ordering of the machine and >also if the following bytes are Unicode or UTF-8. > > Is this something all editors that save files in Unicode or UTF-8 are >required to do? Can I depend on the presence of this marker in my code? > >2) Are there any editors available on unix to allow you to save text in >Unicode or UTF-8? > >Thanks in advance, >-Saba
Received on Friday, 12 May 2000 05:31:29 UTC