Re: ISSUE-126: charset-vs-backslashes - Straw Poll for Objections from Philip Jägenstedt on 2011-03-05 (public-html@w3.org from March 2011)

From: Philip Jägenstedt <philipj@opera.com>
Date: Sat, 05 Mar 2011 22:41:23 +0100
To: public-html@w3.org
Message-ID: <op.vrv3u9fnsr6mfa@nog>
On Sat, 05 Mar 2011 21:27:15 +0100, Julian Reschke <julian.reschke@gmx.de>  
wrote:

> Here are a few comments on Philip's feedback in  
> <http://www.w3.org/2002/09/wbs/40318/issue-126-objection-poll/results>:
>
>> The proposal aims to align processing with the HTTP spec in order to  
>> remove a willfull violation, but does not achieve that, even assuming  
>> that the sibling proposal for ISSUE-125 is adopted.
>>
>> The "algorithm for extracting an encoding from a Content-Type" should  
>> be applied to the value of the content="" attribute on <meta  
>> http-equiv="Content-Type">. In order to claim conformance with HTTP,  
>> that value should be processed like the media-type production in RFC  
>> 2616:
>>
>> media-type = type "/" subtype *( ";" parameter )
>> type = token
>> subtype = token
>>
>> parameter = attribute "=" value
>> attribute = token
>> value = token | quoted-string
>>
>> quoted-string = ( <"> *(qdtext | quoted-pair ) <"> )
>> qdtext = <any TEXT except <">>
>> quoted-pair = "\" CHAR
>>
>> The critical part of the suggested change is "Return the encoding  
>> corresponding to the backslash-unescaped string between this characters  
>> and the next earliest occurrence of this character." This is more  
>> liberal than the quoted-string production, allowing e.g.  
>> content='text/html;charset="UTF-8"garbage'.
>
> Indeed; see my mail from  
> <http://lists.w3.org/Archives/Public/public-html/2011Jan/0358.html>. It  
> may have been a bad decision to put this into three ISSUEs; this is  
> mainly a result of Ian refusing to look at Bugzilla entries that  
> describe multiple, related problems.
>
> Writing CPs for each of these while the others are in progress makes  
> things hard.
>
>> Furthermore, earlier steps of the algorithm are nowhere near close to  
>> the HTTP spec, simply finding the first occurence of "charset",  
>> allowing e.g. content='garbagecharset=UTF-8'.
>
> I believe this is ISSUE-148.
>
>> Only if the algorithm as a whole matches exactly the media-type  
>> production will the spec not require "recipients to parse Content-Type  
>> headers in <meta> elements in a way breaking HTTP's parsing rules."  
>> Since the change proposal does not achieve that, I object to its  
>> adoption.
>
> Again, it's a process problem that we're looking at three issues at the  
> same time.

OK, I wasn't aware that there was a third issue as well. Would it be fair  
to simply treat the sum of your proposals as a single proposal that causes  
the content="" attribute value to be parsed as per the media-type  
production?

> The bug was originally raised because the spec claims that the described  
> behavior was needed for compatibility with "existing content". This has  
> been proven to be nonsense, or minimally an exaggeration.

It seems to me that parsing as per the the media-type production is  
actually extremely likely to break existing content. The impact of  
backslash escaping or quotes is likely rather small (not zero), but the  
way the charset parameter is extracted (ISSUE-148) is much more serious.  
The following kinds of typos are very likely to exist in the wild in  
fairly large numbers, and would break:

content='text/html charset=UTF-8' (missing semicolon)
content='text/html: charset=UTF-8' (colon instead of semicolon)
content='text/html; charset = UTF-8' (whitespace between attribute and  
value)
content='text/html; charset=UTF-8;' (trailing semicolon)
content='text/html;; charset=UTF-8' (double semicolon)

> If we follow Anne's proposal for ISSUE-125 we'll at least have spec text  
> that simply states that parsing of meta tag values is different from  
> HTTP header field values, which is an improvement. We can then focus on  
> deciding *which* of all of these differences make sense/are "required".

One could instrument an existing HTML5 parser to strictly use the  
media-type production, then running that and a standard one a few millions  
of web pages. My guess is that we'd find that the detected encoding is  
different on a non-neglible percentage of pages.

-- 
Philip Jägenstedt
Core Developer
Opera Software
Received on Saturday, 5 March 2011 21:42:00 UTC