W3C home > Mailing lists > Public > www-tag@w3.org > September 2010

Re: HTML5 discussions regarding charset determination and sniffing

From: Julian Reschke <julian.reschke@gmx.de>
Date: Thu, 30 Sep 2010 16:19:27 +0200
Message-ID: <4CA49C6F.8040201@gmx.de>
To: Noah Mendelsohn <nrm@arcanedomain.com>
CC: Noah Mendelsohn <noah@arcanedomain.com>, "www-tag@w3.org" <www-tag@w3.org>
On 30.09.2010 16:01, Noah Mendelsohn wrote:
> Julian Reschke writes:
>
>  > The background is that HTML5 specifies an algorithm for extracting the
>  > charset from content type information, which (1) requires accepting
> invalid
>  > forms (single quotes), and (2) requires not to properly handle
> escapes in
>  > quoted strings.
>
> Thank you for the very helpful clarification. I agree that these
> "willfull violations" are significant, and should be minimized to the
> extent practical. There is a big grey area between "sniffing" and
> silently recovering from syntactic or other errors in headers. This
> seems more toward the latter: allowing single quotes where double is
> required is a different sort of "being liberal" than looking at
> something labeled text/plain and determining "aha, you meant
> image/jpeg". Thanks!
>
> Noah

Note that allowing single quotes instead of double quotes may sound 
harmless, but:

<http://greenbytes.de/tech/webdav/rfc2616.html#rfc.section.14.17>:

     Content-Type   = "Content-Type" ":" media-type

<http://greenbytes.de/tech/webdav/rfc2616.html#rfc.section.3.7>:

     media-type     = type "/" subtype *( ";" parameter )

<http://greenbytes.de/tech/webdav/rfc2616.html#rfc.section.3.6>:

     parameter               = attribute "=" value
     attribute               = token
     value                   = token | quoted-string

and finally <http://greenbytes.de/tech/webdav/rfc2616.html#rfc.section.2.2>:

     token          = 1*<any CHAR except CTLs or separators>
     separators     = "(" | ")" | "<" | ">" | "@"
                    | "," | ";" | ":" | "\" | <">
                    | "/" | "[" | "]" | "?" | "="
                    | "{" | "}" | SP | HT

So the single quote is indeed allowed in tokens, and

     charset='foobar'

should be parsed as

     'foobar'

not

     foobar

(note that single quotes in parameter values using the token syntax are 
indeed in use).

Requiring special treatment will either cause UAs to have separate 
parsers (not good), or potentially break legitimate uses of single 
quotes in other header fields (very bad).

I totally agree that UAs are very bad in header parsing; but adding more 
special cases doesn't seem to be an improvement.


Best regards, Julian
Received on Thursday, 30 September 2010 14:46:51 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Thursday, 26 April 2012 12:48:25 GMT