W3C home > Mailing lists > Public > www-international@w3.org > July to September 2003

Re: Character encoding interpretation by a text editor

From: Martin Duerst <duerst@w3.org>
Date: Thu, 24 Jul 2003 16:46:36 -0400
Message-Id: <4.2.0.58.J.20030724164117.05765018@localhost>
To: "Desaulniers, Peter" <Peter.Desaulniers@pahv.xerox.com>(by way of Martin Duerst <duerst@w3.org>), www-international@w3.org

Hello Peter,

I forwarded this mail as the moderator, and my mailer garbled your
characters. But I'll try to explain your mail and answer your
question.

At 16:38 03/07/24 -0400, Desaulniers, Peter wrote:




>Dear all,
>
>I am just trying to understand the fundamentals of inputing and output
>characters to files or other byte streams.
>
>I tried an experiment which I can not explain.  Please read the following
>and see if you can offer an explanation.
>
>Using Microsoft Notepad...
>
>I created a text file with the character:  $Bq(B

That was supposed to be e-accute before my mailer mangled it.

>I store it as ANSI, the file contains the byte: E9  (as viewed by a binary
>editor)
>
>I store it again as UTF8, that file contains the bytes: C3 A9

If you use Notepad, the file will also contain what is called
an UTF-8 BOM (byte order mark). The overall size of the file
will be 5 bytes, not 2. Please check with the 'dir' command.


>Then I open the ANSI file and I see $Br"(B (decodes E9 as ANSI)

That still is e-accute.


>Then I open the UTF8 file and I see $Br"(B (decodes C3 A9 as UTF8).   Why do I
>not see the ANSI characters: $B%F%%(B?

Because Notepad uses the BOM to identify this file as UTF-8.


>How can opening two files with the same application with different bytes be
>decoded into the same character?

Notepad uses the BOM. It is also possible (although somewhat risky,
especially on very small files such as this one) to do the following:
- Check if the file looks like UTF-8
- If the file looks like UTF-8, decode as UTF-8
- Else, decode with the legacy encoding of your OS.


>If its the appropriate protocol for this forum, please reply directly to my
>email address since I do not receive mail from the www-international@w3.org
>mailings.

This is somewhat marginal, but the BOM in UTF-8 can cause a problem
on the Web because not all browsers support it. Better remove the BOM
before you actually publish a file.


Regards,    Martin.
Received on Thursday, 24 July 2003 16:46:45 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Tuesday, 2 June 2009 19:17:00 GMT