Change Proposal for ISSUE-148, was: ISSUE-148 (charset-detect): Chairs Solicit Proposals from Julian Reschke on 2011-01-29 (public-html@w3.org from January 2011)

From: Julian Reschke <julian.reschke@gmx.de>
Date: Sat, 29 Jan 2011 19:47:01 +0100
To: Sam Ruby <rubys@intertwingly.net>
CC: HTML WG <public-html@w3.org>
Message-ID: <4D4460A5.5000309@gmx.de>

On 05.01.2011 19:54, Sam Ruby wrote:
> 'Algorithm for detecting the charset="" parameter'
>
> Per the decision policy, at this time the chairs would like to solicit
> volunteers to write Change Proposals.
>
> http://www.w3.org/html/wg/tracker/issues/148
> http://dev.w3.org/html5/decision-policy/decision-policy.html#escalation
>
> If no Change Proposals are written by February 6th, 2011 this issue
> will be closed without prejudice.
>
> Issue status link:
> http://dev.w3.org/html5/status/issue-status.html#ISSUE-148
>
> - Sam Ruby

Below is a Change Proposal for ISSUE-148.

-- snip --
SUMMARY

The "algorithm for extracting an encoding from a Content-Type" [1] 
handles certain field values incorrectly.

RATIONALE

The "algorithm for extracting an encoding from a Content-Type" [1] tries 
to define a shortcut for parsing Content-Type header field values 
(defined in RFC 2616, 3.7, [2]). Trying shortcuts is dangerous, you may 
treat edge cases incorrectly.

Furthermore, the spec claims that the shortcut is needed for "backwards 
compatibility with legacy content", but we do have evidence that 
existing UAs disagree (see [3]).

DETAILS

The following field value

   text/plain; foocharset=UTF-8

is treated incorrectly, because the algorithm "sees" a charset parameter 
when there isn't one.

Minimally, the algorithm needs to be modified so that it checks for 
delimiters before the string "charset":

Change:

"2. Loop: Find the first seven characters in s after position that are 
an ASCII case-insensitive match for the word "charset". If no such match 
is found, return nothing and abort these steps."

to

"2. Loop: Find the first seven characters in s after position that are 
an ASCII case-insensitive match for the word "charset" and follow a 
delimiter character. If no such match is found, return nothing and abort 
these steps."

Define "delimiter character" suitably, such as "control characters, 
whitespace, and U+003B SEMICOLON character (;)".

Alternatively, drop the whole algorithm, and just state that the value 
should be parsed according to the media-type grammer in [2]; see 
ISSUE-125 and ISSUE-126.

IMPACT

1. Positive Effects

Fewer incorrect matches when parsing the value.

2. Negative Effects

None.

3. Conformance Classes Changes

Not sure. Does this section describe conformance requirements?

4. Risks

None.

REFERENCES

[1] <http://dev.w3.org/html5/spec/Overview.html#content-type-sniffing>
[2] <http://greenbytes.de/tech/webdav/rfc2616.html#rfc.section.3.7>
[3] <http://www.w3.org/Bugs/Public/show_bug.cgi?id=9628#c3>

Received on Saturday, 29 January 2011 19:00:46 UTC