- From: Julian Reschke <julian.reschke@gmx.de>
- Date: Sat, 29 Jan 2011 19:47:01 +0100
- To: Sam Ruby <rubys@intertwingly.net>
- CC: HTML WG <public-html@w3.org>
On 05.01.2011 19:54, Sam Ruby wrote: > 'Algorithm for detecting the charset="" parameter' > > Per the decision policy, at this time the chairs would like to solicit > volunteers to write Change Proposals. > > http://www.w3.org/html/wg/tracker/issues/148 > http://dev.w3.org/html5/decision-policy/decision-policy.html#escalation > > If no Change Proposals are written by February 6th, 2011 this issue > will be closed without prejudice. > > Issue status link: > http://dev.w3.org/html5/status/issue-status.html#ISSUE-148 > > - Sam Ruby Below is a Change Proposal for ISSUE-148. -- snip -- SUMMARY The "algorithm for extracting an encoding from a Content-Type" [1] handles certain field values incorrectly. RATIONALE The "algorithm for extracting an encoding from a Content-Type" [1] tries to define a shortcut for parsing Content-Type header field values (defined in RFC 2616, 3.7, [2]). Trying shortcuts is dangerous, you may treat edge cases incorrectly. Furthermore, the spec claims that the shortcut is needed for "backwards compatibility with legacy content", but we do have evidence that existing UAs disagree (see [3]). DETAILS The following field value text/plain; foocharset=UTF-8 is treated incorrectly, because the algorithm "sees" a charset parameter when there isn't one. Minimally, the algorithm needs to be modified so that it checks for delimiters before the string "charset": Change: "2. Loop: Find the first seven characters in s after position that are an ASCII case-insensitive match for the word "charset". If no such match is found, return nothing and abort these steps." to "2. Loop: Find the first seven characters in s after position that are an ASCII case-insensitive match for the word "charset" and follow a delimiter character. If no such match is found, return nothing and abort these steps." Define "delimiter character" suitably, such as "control characters, whitespace, and U+003B SEMICOLON character (;)". Alternatively, drop the whole algorithm, and just state that the value should be parsed according to the media-type grammer in [2]; see ISSUE-125 and ISSUE-126. IMPACT 1. Positive Effects Fewer incorrect matches when parsing the value. 2. Negative Effects None. 3. Conformance Classes Changes Not sure. Does this section describe conformance requirements? 4. Risks None. REFERENCES [1] <http://dev.w3.org/html5/spec/Overview.html#content-type-sniffing> [2] <http://greenbytes.de/tech/webdav/rfc2616.html#rfc.section.3.7> [3] <http://www.w3.org/Bugs/Public/show_bug.cgi?id=9628#c3>
Received on Saturday, 29 January 2011 19:00:46 UTC