Re: ISSUE-125 CCP -- change the "willful violation" note -- rev 1 from Leif Halvard Silli on 2011-01-27 (public-html@w3.org from January 2011)

From: Leif Halvard Silli <xn--mlform-iua@xn--mlform-iua.no>
Date: Thu, 27 Jan 2011 15:04:36 +0100
To: Anne van Kesteren <annevk@opera.com>
Cc: Julian Reschke <julian.reschke@gmx.de>, "public-html@w3.org" <public-html@w3.org>
Message-ID: <20110127150436802482.f930bfca@xn--mlform-iua.no>

Anne van Kesteren, Thu, 27 Jan 2011 13:30:08 +0100:
> On Thu, 27 Jan 2011 13:25:03 +0100, Leif Halvard Silli  wrote:
>> If you are correct, then where does HTML5 specify how to handle the
>> HTTP Content-Type header?
> 
> HTTP and the Media Type Sniffing specification define that.

But were does HTML5 points to those?

W.r.t. MIMESNIFF, then the section that we discuss, section '2.7.3 
Determining the type of a resource', is the one which points to it. 
This ection is also not only about 'text/html' but about any 
'resource'. Which, again, means that the Content-Type can only come 
from HTTP.

And, I repeat, that if if the UA is configured to 'strictly obey' - as 
MIMESNIFF calls it, then the HTTP headers, then there will be no 
sniffing.

We agree that the algorithm is used twice in the encoding sniffing 
algorithm. Then can you tell me when, according to you, the first of 
those times are? And why would it read the http-equiv twice? According 
to myself, the first time happens *before* the encoding sniffing 
algorithm starts running - the algorithm merely "listens" to what the 
result from Content-Type were: [2]

]] 2. If the transport layer specifies an encoding, and it is 
supported, return that encoding with the confidence certain, and abort 
these steps. [[

Or as that section also states: ]] This algorithm takes as input any 
out-of-band metadata available to the user agent (e.g. the Content-Type 
metadata of the document) and all the bytes available so far, and 
returns an encoding and a confidence. [[

Note also that the http-equiv pragma, per HTML5 is not 'content-type 
metadata' but an encoding declaration. [3] The encoding declaration 
section states that: [4]

]] If an HTML document does not start with a BOM, and if its encoding 
is not explicitly given by Content-Type metadata, and the document is 
not an iframe srcdoc document, then the character encoding used must be 
an ASCII-compatible character encoding, and, in addition, if that 
encoding isn't US-ASCII itself, then the encoding must be specified 
using a meta element with a charset attribute or a meta element with an 
http-equiv attribute in the Encoding declaration state. [[

Thus, the encoding declaration - in form of meta@charst or 
metea@http-equiv=content-type - is only used when there isn't a BOM or 
when the Content-Type meta data (which, again, is described in [1]), 
does not provide confident encoding information.

Note that out-of-band can also be info from the file system - says 
MIMESNIFF.

It is clear, to me, that HTML5's encoding sniffing algorithm overlaps 
with things said in MIMESNIFF. Or would you say that those 512 bytes in 
step 3 of HTML5's encoding sniffing algorithm refers to another stream 
than the 512 bytes in MIMESNIFF? In that regard, MIMESNIFF states that

]] For efficiency reasons, implementations might wish to implement this
   algorithm and the algorithm for detecting the character encoding of
   HTML documents in parallel. [[

In a summary: Can't see that you have proven that I have read the 
spec(s) wrong.

[1] http://www.w3.org/TR/html5/fetching-resources.html#content-type

[2] http://www.w3.org/TR/html5/parsing#encoding-sniffing-algorithm

[3] 
http://www.w3.org/TR/html5/semantics#attr-meta-http-equiv-content-type
[4] http://www.w3.org/TR/html5/semantics#character-encoding-declaration

-- 
leif halvard silli

Received on Thursday, 27 January 2011 14:05:11 UTC