- From: <bugzilla@jessica.w3.org>
- Date: Fri, 30 Apr 2010 14:49:49 +0000
- To: public-html-bugzilla@w3.org
http://www.w3.org/Bugs/Public/show_bug.cgi?id=9628
Summary: "willful violation" for detecting the charset
Product: HTML WG
Version: unspecified
Platform: PC
OS/Version: Windows NT
Status: NEW
Severity: normal
Priority: P2
Component: HTML5 spec bugs
AssignedTo: dave.null@w3.org
ReportedBy: julian.reschke@gmx.de
QAContact: public-html-bugzilla@w3.org
CC: ian@hixie.ch, mike@w3.org, public-html@w3.org
>From http://dev.w3.org/html5/spec/infrastructure.html#content-type-sniffing:
"The algorithm for extracting an encoding from a Content-Type, given a string
s, is as follows. It either returns an encoding or nothing.
1.
Find the first seven characters in s that are an ASCII case-insensitive
match for the word "charset". If no such match is found, return nothing.
2.
Skip any U+0009, U+000A, U+000C, U+000D, or U+0020 characters that
immediately follow the word "charset" (there might not be any).
3.
If the next character is not a U+003D EQUALS SIGN ('='), return nothing
and abort these steps.
4.
Skip any U+0009, U+000A, U+000C, U+000D, or U+0020 characters that
immediately follow the equals sign (there might not be any).
5.
Process the next character as follows:
If it is a U+0022 QUOTATION MARK ('"') and there is a later U+0022
QUOTATION MARK ('"') in s
If it is a U+0027 APOSTROPHE ("'") and there is a later U+0027 APOSTROPHE
("'") in s
Return the encoding corresponding to the string between this
character and the next earliest occurrence of this character.
If it is an unmatched U+0022 QUOTATION MARK ('"')
If it is an unmatched U+0027 APOSTROPHE ("'")
If there is no next character
Return nothing.
Otherwise
Return the encoding corresponding to the string from this character
to the first U+0009, U+000A, U+000C, U+000D, U+0020, or U+003B character or the
end of s, whichever comes first.
Note: This requirement is a willful violation of the HTTP specification,
motivated by the need for backwards compatibility with legacy content. [HTTP]"
General problems:
(1) This algorithm doesn't seem to be used.
(2) It's VERY unfriendly to the reader to claim that there's a violation of the
HTTP spec without saying what it is.
Specific problems:
(3) The algorithm requires allowing single quotes; this is indeed a violation
of the HTTP syntax. I just checked with IE8; it doesn't allow single quotes.
Thus, the claim "needed for backwards compatibility" appears to be incorrect.
(4) The spec also violates HTTP in that the backslash character inside quoted
values isn't treated properly. If this is needed "for compatibility", this
should be backed up with data.
--
Configure bugmail: http://www.w3.org/Bugs/Public/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the QA contact for the bug.
Received on Friday, 30 April 2010 14:49:50 UTC