[Bug 11973] HTML Spec confuses character sets with character encodings

http://www.w3.org/Bugs/Public/show_bug.cgi?id=11973

--- Comment #6 from Craig S <craig.e.shea@gmail.com> 2011-02-03 21:07:50 UTC ---
(In reply to comment #5)
> As far as I can tell, the spec is not confused; it's just as Julian says that
> some attributes/params have unfortunate names for legacy reasons.

I agree. However, I still maintain that the spec really should say that "The
charset attribute specifies the character set used by the document.", as this
seems to be the way UA's are in fact treating it. This would at least make the
spec "definition" align with the attribute name. In addition, perhaps the spec
could mandate that all HTML files are to be encoded (read stored or saved) as
UTF-8. Then, with the combination of the mandated encoding, and the declaration
of the character set, a UA knows how to interpret the document. Also, it
preserves 99.999% of all web pages in the wild (since ANSI/ASCII plain is
already valid UTF-8). Furthermore, the spec can continue to say that the
default character set for HTML is UTF-8 (and should you want anything
different, be sure to specify it with the META tag using one of the specified
methods).

I found a snippet of text on stackoverflow.com
(http://stackoverflow.com/questions/2014069/windows-1252-to-utf-8-encoding)
that was interesting:

"While utf8 is valid Win-1252, the reverse is not true: win-1252 is NOT valid
UTF-8."

This explains why I see "funky" characters in my HTML page when sent as
charset=UTF-8 as opposed to charset=windows-1252 (which displays correctly).

Thank you all for your comments.

-- 
Configure bugmail: http://www.w3.org/Bugs/Public/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the QA contact for the bug.

Received on Thursday, 3 February 2011 21:07:52 UTC