- From: <bugzilla@jessica.w3.org>
- Date: Thu, 03 Feb 2011 19:51:58 +0000
- To: public-html-bugzilla@w3.org
http://www.w3.org/Bugs/Public/show_bug.cgi?id=11973 Summary: HTML Spec confuses character sets with character encodings Product: HTML WG Version: unspecified Platform: PC OS/Version: other Status: NEW Severity: normal Priority: P2 Component: HTML5 spec (editor: Ian Hickson) AssignedTo: ian@hixie.ch ReportedBy: craig.e.shea@gmail.com QAContact: public-html-bugzilla@w3.org CC: mike@w3.org, public-html-wg-issue-tracking@w3.org, public-html@w3.org The relevant part of the specification is as follows: "The charset attribute specifies the character encoding used by the document." This has been a problem with the HTML specification since...well, since at least HTML 3.2, but more of a problem in the late '90s since HTML 4.01, the formulation of XML 1.0, and the rising use of Unicode for information interchange and exchange. What I find confusing is the specification's mixing up of the terms character set and character encoding. At one point, the spec is talking about the character set of the document, while at another it is clearly talking about the character encoding of the document--though using the misnomer attribute name @charset. I recently worked on a project grabbing text from MS Word, storing it in a database, and retrieving that text to display on a web page. As you might have already guessed, I got "funky" characters in the output (usually ?'s, some /'s and a few boxes). This is to be expected, unfortunately. The problem is that the text was saved in the UTF-8 character encoding, and so the web page was sent with the following Content-Type: "text/html; charset=UTF-8". However, the document is using the windows-1252 character set. Let me rephrase this: the text is encoded with UTF-8 using the windows-1252 character set (which is what MS Word uses). Now, if I change the http-equiv=Content-Type to the following: "text/html; charset=Windows-1252", then the document displays correctly. Therefore, even though the spec clearly says that "charset...specifies the character encoding used by the document", it should instead read "charset...specifies the character set used by the document." However, I recognize that it is equally important for a UA to know how the document is encoded, as has been discussed with the potential security implications of UTF-7 over UTF-8, for example. Therefore, I propose that the specification also include an @encoding attribute, perhaps on the META element, much as XML 1.0 has an @encoding attribute. In this way, a UA can unambiguously determine both the encoding used to store the document as bytes, and the set of characters those bytes encode, i.e. the character set. Furthermore, a META element with the @encoding attribute should be mandatory since it is impossible to differentiate, for example, a document that has been stored in UTF-8 vs. ANSI. With these two pieces of information, a UA now knows how to decode the bytes of a document and which characters those bytes encode. -- Configure bugmail: http://www.w3.org/Bugs/Public/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the QA contact for the bug.
Received on Thursday, 3 February 2011 19:52:04 UTC