[Bug 11973] New: HTML Spec confuses character sets with character encodings from bugzilla@jessica.w3.org on 2011-02-03 (public-html@w3.org from February 2011)

From: <bugzilla@jessica.w3.org>
Date: Thu, 03 Feb 2011 19:51:58 +0000
To: public-html@w3.org
Message-ID: <bug-11973-2495@http.www.w3.org/Bugs/Public/>
http://www.w3.org/Bugs/Public/show_bug.cgi?id=11973

           Summary: HTML Spec confuses character sets with character
                    encodings
           Product: HTML WG
           Version: unspecified
          Platform: PC
        OS/Version: other
            Status: NEW
          Severity: normal
          Priority: P2
         Component: HTML5 spec (editor: Ian Hickson)
        AssignedTo: ian@hixie.ch
        ReportedBy: craig.e.shea@gmail.com
         QAContact: public-html-bugzilla@w3.org
                CC: mike@w3.org, public-html-wg-issue-tracking@w3.org,
                    public-html@w3.org


The relevant part of the specification is as follows: "The charset attribute
specifies the character encoding used by the document."

This has been a problem with the HTML specification since...well, since at
least HTML 3.2, but more of a problem in the late '90s since HTML 4.01, the
formulation of XML 1.0, and the rising use of Unicode for information
interchange and exchange.

What I find confusing is the specification's mixing up of the terms character
set and character encoding. At one point, the spec is talking about the
character set of the document, while at another it is clearly talking about the
character encoding of the document--though using the misnomer attribute name
@charset. 

I recently worked on a project grabbing text from MS Word, storing it in a
database, and retrieving that text to display on a web page. As you might have
already guessed, I got "funky" characters in the output (usually ?'s, some /'s
and a few boxes). This is to be expected, unfortunately.

The problem is that the text was saved in the UTF-8 character encoding, and so
the web page was sent with the following Content-Type: "text/html;
charset=UTF-8". However, the document is using the windows-1252 character set.
Let me rephrase this: the text is encoded with UTF-8 using the windows-1252
character set (which is what MS Word uses).

Now, if I change the http-equiv=Content-Type to the following: "text/html;
charset=Windows-1252", then the document displays correctly. Therefore, even
though the spec clearly says that "charset...specifies the character encoding
used by the document", it should instead read "charset...specifies the
character set used by the document."

However, I recognize that it is equally important for a UA to know how the
document is encoded, as has been discussed with the potential security
implications of UTF-7 over UTF-8, for example. Therefore, I propose that the
specification also include an @encoding attribute, perhaps on the META element,
much as XML 1.0 has an @encoding attribute. In this way, a UA can unambiguously
determine both the encoding used to store the document as bytes, and the set of
characters those bytes encode, i.e. the character set. Furthermore, a META
element with the @encoding attribute should be mandatory since it is impossible
to differentiate, for example, a document that has been stored in UTF-8 vs.
ANSI.

With these two pieces of information, a UA now knows how to decode the bytes of
a document and which characters those bytes encode.

-- 
Configure bugmail: http://www.w3.org/Bugs/Public/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.
Received on Thursday, 3 February 2011 19:52:00 UTC