- From: Martin Duerst <duerst@w3.org>
- Date: Thu, 24 Jul 2003 16:46:36 -0400
- To: "Desaulniers, Peter" <Peter.Desaulniers@pahv.xerox.com>(by way of Martin Duerst <duerst@w3.org>), www-international@w3.org
Hello Peter, I forwarded this mail as the moderator, and my mailer garbled your characters. But I'll try to explain your mail and answer your question. At 16:38 03/07/24 -0400, Desaulniers, Peter wrote: >Dear all, > >I am just trying to understand the fundamentals of inputing and output >characters to files or other byte streams. > >I tried an experiment which I can not explain. Please read the following >and see if you can offer an explanation. > >Using Microsoft Notepad... > >I created a text file with the character: $Bqî(B That was supposed to be e-accute before my mailer mangled it. >I store it as ANSI, the file contains the byte: E9 (as viewed by a binary >editor) > >I store it again as UTF8, that file contains the bytes: C3 A9 If you use Notepad, the file will also contain what is called an UTF-8 BOM (byte order mark). The overall size of the file will be 5 bytes, not 2. Please check with the 'dir' command. >Then I open the ANSI file and I see $Br"(B (decodes E9 as ANSI) That still is e-accute. >Then I open the UTF8 file and I see $Br"(B (decodes C3 A9 as UTF8). Why do I >not see the ANSI characters: $B%F%%(B? Because Notepad uses the BOM to identify this file as UTF-8. >How can opening two files with the same application with different bytes be >decoded into the same character? Notepad uses the BOM. It is also possible (although somewhat risky, especially on very small files such as this one) to do the following: - Check if the file looks like UTF-8 - If the file looks like UTF-8, decode as UTF-8 - Else, decode with the legacy encoding of your OS. >If its the appropriate protocol for this forum, please reply directly to my >email address since I do not receive mail from the www-international@w3.org >mailings. This is somewhat marginal, but the BOM in UTF-8 can cause a problem on the Web because not all browsers support it. Better remove the BOM before you actually publish a file. Regards, Martin.
Received on Thursday, 24 July 2003 16:46:45 UTC