W3C home > Mailing lists > Public > public-html@w3.org > January 2011

(unknown charset) Re: ISSUE-125 CCP -- change the "willful violation" note -- rev 1

From: (unknown charset) Leif Halvard Silli <xn--mlform-iua@xn--mlform-iua.no>
Date: Thu, 27 Jan 2011 08:28:35 +0100
To: (unknown charset) Anne van Kesteren <annevk@opera.com>, Julian Reschke <julian.reschke@gmx.de>
Cc: (unknown charset) "public-html@w3.org" <public-html@w3.org>
Message-ID: <20110127082835089020.d976ab6f@xn--mlform-iua.no>
Anne, HTML5's 'encoding sniffing algorithm' [1] uses the 'algorithm for 
extracting an encoding from a Content-Type' [2] twice: 

	1) before parsing: on Content-Type meta data (HTTP). [1]
	2) during parsing: on meta element pragma in encoding declaration
       state (http-equiv=content-type). [3]

Thus ISSUE-125 can't be isolated to HTTP-EQUIV unless the encoding 
sniffing algorithm [1] is changed. One could *make* what your CP builds 
on a reality by specifying 2 algorithms - one for HTTP-EQUIV and 
another one for HTTP. Until then, the combo 1 algorithm + HTTP 
non-violence requires

	EITHER a rewrite of the algorithm, as Julian suggested;
	OR solving the issue on the authoring requirements level;

It seems to me that the OR option is what we currently have: 
Validator.nu screams if authors use HTTP-invalid syntax, despite what 
the algorithm accepts. 

Of course, the OR option is still a violation of HTTP ... Who does it 
help to interpret invalid charset names as if they were valid? I fail 
to see how anyone that was aware about his/her own deeds, would 
"willfully" use quotes around both sides of the charset name when 
inside a HTTP-EQUIV="Content-TYPE" element.

In that regard, on Sun, 23 Jan 2011 15:36:35 +0100 you said:

> I did not say that. What I said is that it makes sense to change HTTP 
> because double and single quotes can be used all over the Web 
> Platform interchangeably. Often though more lenient syntax is more 
> compatible and authors do not always test in IE. There are places 

'Interchangeably' sounds nice. But are there any logics here? Where? 
With my limited knowledge of the HTTP spec and the rules for what 
characters a charset encoding names may contain, I do of course agree 
that it seems strange that encoding names can contain the single quote 
character. But then, we need to fix _that_ problem. I don't see how we 
fix that problem by keeping this algorithm: Even if we keep the 
algorithm you are fighting for, authors are still prohibited from using 
that syntax. So were is the interchangeability ..

Btw, the 'encoding sniffing algorithm' [1] permits UAs to use 
'information on the likely encoding for this page' etc, so such invalid 
encoding names could be used, at a later step in the encoding sniffing 
algorithm.

[1] 
http://www.w3.org/TR/html5/parsing#determining-the-character-encoding
[2] 
http://www.w3.org/TR/html5/fetching-resources#algorithm-for-extracting-an-encoding-from-a-content-type
[3] http://www.w3.org/TR/html5/tokenization#meta-charset-during-parse

Leif Halvard Silli

Anne van Kesteren, Mon, 24 Jan 2011 16:50:13 +0100:
> Summary: Change the note after "algorithm for extracting an encoding 
> from a Content-Type" to not mention HTTP as HTTP is not affected by 
> this algorithm.
> 
> Rationale: "algorithm for extracting an encoding from a Content-Type" 
> is only used to examine the contents of a document and therefore does 
> not affect HTTP. Claiming it a willful violation of HTTP is 
> misleading.
> 
> Details: Instead of saying this is a willful violation of HTTP say 
> this is a distinct algorithm from HTTP Content-Type processing for 
> usage outside of HTTP.
> 
> Impact: Hardly.
> 
> Anne van Kesteren
> http://annevankesteren.nl/
-- 
leif halvard silli
Received on Thursday, 27 January 2011 07:29:10 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Wednesday, 9 May 2012 00:17:21 GMT