- From: <bugzilla@jessica.w3.org>
- Date: Fri, 30 Apr 2010 14:49:49 +0000
- To: public-html-bugzilla@w3.org
http://www.w3.org/Bugs/Public/show_bug.cgi?id=9628 Summary: "willful violation" for detecting the charset Product: HTML WG Version: unspecified Platform: PC OS/Version: Windows NT Status: NEW Severity: normal Priority: P2 Component: HTML5 spec bugs AssignedTo: dave.null@w3.org ReportedBy: julian.reschke@gmx.de QAContact: public-html-bugzilla@w3.org CC: ian@hixie.ch, mike@w3.org, public-html@w3.org >From http://dev.w3.org/html5/spec/infrastructure.html#content-type-sniffing: "The algorithm for extracting an encoding from a Content-Type, given a string s, is as follows. It either returns an encoding or nothing. 1. Find the first seven characters in s that are an ASCII case-insensitive match for the word "charset". If no such match is found, return nothing. 2. Skip any U+0009, U+000A, U+000C, U+000D, or U+0020 characters that immediately follow the word "charset" (there might not be any). 3. If the next character is not a U+003D EQUALS SIGN ('='), return nothing and abort these steps. 4. Skip any U+0009, U+000A, U+000C, U+000D, or U+0020 characters that immediately follow the equals sign (there might not be any). 5. Process the next character as follows: If it is a U+0022 QUOTATION MARK ('"') and there is a later U+0022 QUOTATION MARK ('"') in s If it is a U+0027 APOSTROPHE ("'") and there is a later U+0027 APOSTROPHE ("'") in s Return the encoding corresponding to the string between this character and the next earliest occurrence of this character. If it is an unmatched U+0022 QUOTATION MARK ('"') If it is an unmatched U+0027 APOSTROPHE ("'") If there is no next character Return nothing. Otherwise Return the encoding corresponding to the string from this character to the first U+0009, U+000A, U+000C, U+000D, U+0020, or U+003B character or the end of s, whichever comes first. Note: This requirement is a willful violation of the HTTP specification, motivated by the need for backwards compatibility with legacy content. [HTTP]" General problems: (1) This algorithm doesn't seem to be used. (2) It's VERY unfriendly to the reader to claim that there's a violation of the HTTP spec without saying what it is. Specific problems: (3) The algorithm requires allowing single quotes; this is indeed a violation of the HTTP syntax. I just checked with IE8; it doesn't allow single quotes. Thus, the claim "needed for backwards compatibility" appears to be incorrect. (4) The spec also violates HTTP in that the backslash character inside quoted values isn't treated properly. If this is needed "for compatibility", this should be backed up with data. -- Configure bugmail: http://www.w3.org/Bugs/Public/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the QA contact for the bug.
Received on Friday, 30 April 2010 14:49:50 UTC