W3C home > Mailing lists > Public > www-international@w3.org > April to June 2000

RE: BOM & Unicode editors

From: Saba Sundaramurthy <ssundaramurthy@verisign.com>
Date: Tue, 9 May 2000 18:05:54 -0700
Message-ID: <C713C1768C55D3119D200090277AEECA0117DD68@postal.verisign.com>
To: "'Asmus Freytag'" <asmusf@ix.netcom.com>, www-international@w3.org
Hi,

	Thanks for your response.

	Can you point me to more information on heuristics I can use to
detect a UTF-16 or UTF-8 file. The Microsoft editors I used saved the file
as actual Unicode values (2 byte values). Although I am not familiar with
UTF16 encoding, I assume it results in a different sequence than actual 2
byte Unicode values. So could you also help me identify pure unicode data
too?

	Where can I find more info. on byte order detection in the absence
of the BOM.

-Saba


> -----Original Message-----
> From: Asmus Freytag [mailto:asmusf@ix.netcom.com]
> Sent: Tuesday, May 09, 2000 5:14 PM
> To: Saba Sundaramurthy; mozilla-i18n@mozilla.org;
> www-international@w3.org; i18n-prog@acoin.com
> Subject: Re: BOM & Unicode editors
> 
> 
> At 04:55 PM 5/9/00 -0700, Saba Sundaramurthy wrote:
> >     Is this something all editors that save files in 
> Unicode or UTF-8 are
> >required to do? Can I depend on the presence of this marker 
> in my code?
> 
> No, it's not a requirement, but it's a convention followed by 
> quite a few 
> tools,
> because otherwise it's harder to use the same .txt extension 
> for both ASCII and
> Unicode (and also it helps to mark the byte order, of course).
> 
> I would recommend that you look for it in your code, if you 
> plan to read UTF-16
> files. At the minimum you need to be prepared for its 
> presence. But you may
> possibly encounter some un-marked UTF-16. There are some quite strong 
> heuristics that one can follow to detect Unicode without a BOM, but a 
> signature like this is more reliable.
> 
> A./
> 
Received on Tuesday, 9 May 2000 21:06:31 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Tuesday, 2 June 2009 19:16:55 GMT