Re: ISSUE-125 CCP -- change the "willful violation" note -- rev 1 from Anne van Kesteren on 2011-01-27 (public-html@w3.org from January 2011)

From: Anne van Kesteren <annevk@opera.com>
Date: Thu, 27 Jan 2011 17:29:14 +0100
To: "Leif Halvard Silli" <xn--mlform-iua@xn--mlform-iua.no>
Cc: "Julian Reschke" <julian.reschke@gmx.de>, "public-html@w3.org" <public-html@w3.org>
Message-ID: <op.vpy6q0x064w2qv@anne-van-kesterens-macbook-pro.local>

On Thu, 27 Jan 2011 15:04:36 +0100, Leif Halvard Silli  
<xn--mlform-iua@målform.no> wrote:
> Anne van Kesteren, Thu, 27 Jan 2011 13:30:08 +0100:
>> HTTP and the Media Type Sniffing specification define that.
>
> But were does HTML5 points to those?
>
> W.r.t. MIMESNIFF, then the section that we discuss, section '2.7.3
> Determining the type of a resource', is the one which points to it.
> This ection is also not only about 'text/html' but about any
> 'resource'. Which, again, means that the Content-Type can only come
> from HTTP.

Yes, if it comes from HTTP and has an encoding declared there the  
algorithm under discussion will not be used.


> And, I repeat, that if if the UA is configured to 'strictly obey' - as
> MIMESNIFF calls it, then the HTTP headers, then there will be no
> sniffing.
>
> We agree that the algorithm is used twice in the encoding sniffing
> algorithm. Then can you tell me when, according to you, the first of
> those times are?

The first time is during the pre-parser-scan of the resource and the  
second time is while parsing in case the encoding is still not definitive.


> And why would it read the http-equiv twice? According
> to myself, the first time happens *before* the encoding sniffing
> algorithm starts running - the algorithm merely "listens" to what the
> result from Content-Type were: [2]
>
> ]] 2. If the transport layer specifies an encoding, and it is
> supported, return that encoding with the confidence certain, and abort
> these steps. [[
>
> Or as that section also states: ]] This algorithm takes as input any
> out-of-band metadata available to the user agent (e.g. the Content-Type
> metadata of the document) and all the bytes available so far, and
> returns an encoding and a confidence. [[

That is a different algorithm from the one under discussion.


> Note also that the http-equiv pragma, per HTML5 is not 'content-type
> metadata' but an encoding declaration. [3] The encoding declaration
> section states that: [4]
>
> ]] If an HTML document does not start with a BOM, and if its encoding
> is not explicitly given by Content-Type metadata, and the document is
> not an iframe srcdoc document, then the character encoding used must be
> an ASCII-compatible character encoding, and, in addition, if that
> encoding isn't US-ASCII itself, then the encoding must be specified
> using a meta element with a charset attribute or a meta element with an
> http-equiv attribute in the Encoding declaration state. [[
>
> Thus, the encoding declaration - in form of meta@charst or
> metea@http-equiv=content-type - is only used when there isn't a BOM or
> when the Content-Type meta data (which, again, is described in [1]),
> does not provide confident encoding information.

You are drawing the wrong conclusion. It is perfectly fine to have both  
HTTP Content-Type and a <meta charset>. What you quoted makes limitations  
on the encoding if there is no Content-Type metadata, it does not say  
anything else.


> Note that out-of-band can also be info from the file system - says
> MIMESNIFF.
>
> It is clear, to me, that HTML5's encoding sniffing algorithm overlaps
> with things said in MIMESNIFF. Or would you say that those 512 bytes in
> step 3 of HTML5's encoding sniffing algorithm refers to another stream
> than the 512 bytes in MIMESNIFF? In that regard, MIMESNIFF states that
>
> ]] For efficiency reasons, implementations might wish to implement this
>    algorithm and the algorithm for detecting the character encoding of
>    HTML documents in parallel. [[

What Media Type Sniffing does with those first 512 bytes is not extracting  
the encoding but determining the type of the resource. Determining the  
correct Content-Type header for the resource happens elsewhere. You are  
confusing algorithms.


> In a summary: Can't see that you have proven that I have read the
> spec(s) wrong.

I give up.


> [1] http://www.w3.org/TR/html5/fetching-resources.html#content-type
> [2] http://www.w3.org/TR/html5/parsing#encoding-sniffing-algorithm
> [3]
> http://www.w3.org/TR/html5/semantics#attr-meta-http-equiv-content-type
> [4] http://www.w3.org/TR/html5/semantics#character-encoding-declaration


-- 
Anne van Kesteren
http://annevankesteren.nl/

Received on Thursday, 27 January 2011 16:29:49 UTC