W3C home > Mailing lists > Public > public-html-bugzilla@w3.org > February 2011

[Bug 11973] HTML Spec confuses character sets with character encodings

From: <bugzilla@jessica.w3.org>
Date: Thu, 03 Feb 2011 20:16:27 +0000
To: public-html-bugzilla@w3.org
Message-Id: <E1Pl5bP-0000tR-Mf@jessica.w3.org>
http://www.w3.org/Bugs/Public/show_bug.cgi?id=11973

--- Comment #3 from Craig S <craig.e.shea@gmail.com> 2011-02-03 20:16:26 UTC ---
I was doing a lot of reading, and this is the best I could explain it.

I took a Word document and used the File->Save As feature to save the document
as Html (filtered), which removes all the MS-specific XML namespace stuff and
sticks to traditional HTML.

As we all know, Word replaces the standard apostrophe and double-quotes with
"curly" versions. Now, when I looked at the hexadecimal value of a right single
quote as stored in the document, it had the following value: 0xe2, 0x80, 0x99
(which shows up as lower-case a with a caron, a Euro currency symbol, and the
trademark symbol, in the dump viewer). Now, this is a UTF-8 encoding for a
right single quote. However, in my web browser (IE9 beta), it shows up as a
'?'.

Now, if I actually specify in a META element http-equiv=Content-Type
content="text/html; charset=windows-1252", then the page is displayed correctly
with the correct character, even though that character is still encoded with
the 3 bytes shown above.

Perhaps I misunderstood the problem, however, from what I can see, Word uses
the windows-1252 character set, and when I send the charset=windows-1252 over
to the UA, it displays correctly. As far as I know, windows-1252 does not
necessarily need to be encoded in UTF-8. It could just as easily use ASCII
encoding.

-- 
Configure bugmail: http://www.w3.org/Bugs/Public/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the QA contact for the bug.
Received on Thursday, 3 February 2011 20:16:29 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Thursday, 3 February 2011 20:16:29 GMT