Re: Validator case-sensitive bug for CHARSET? from olivier Thereaux on 2007-08-07 (www-validator@w3.org from August 2007)

From: olivier Thereaux <ot@w3.org>
Date: Tue, 7 Aug 2007 20:38:48 +0900
To: invalid@csc.jp, www-validator Community <www-validator@w3.org>
Message-Id: <64584787-9995-43F3-A34E-D010ED9811D6@w3.org>

On Aug 7, 2007, at 17:48 , invalid@csc.jp wrote:
> <blockquote cite="http://www.ietf.org/rfc/rfc2616.txt">
> 3.7 Media Types
>
>    HTTP uses Internet Media Types [17] in the Content-Type (section
>    14.17) and Accept (section 14.1) header fields in order to provide
>    open and extensible data typing and type negotiation.
>
>        media-type     = type "/" subtype *( ";" parameter )
>        type           = token
>        subtype        = token
>
>    Parameters MAY follow the type/subtype in the form of attribute/ 
> value
>    pairs (as defined in section 3.6).
>
>    The type, subtype, and parameter attribute names are case-
>    insensitive. Parameter values might or might not be case-sensitive,
>    depending on the semantics of the parameter name. (...)
> </blockquote>

Thanks, that's the info I was looking for.

So as far as HTTP (and thus Http-Equiv meta in HTML) is concerned
>   <META HTTP-EQUIV="Content-Type" CONTENT="text/html;  
> charset=ISO-8859-1">
is equivalent to
>   <META HTTP-EQUIV="Content-Type" CONTENT="text/html;  
> CHARSET=ISO-8859-1">
and to
>   <META HTTP-EQUIV="Content-Type" CONTENT="text/html;  
> charSet=ISO-8859-1">
(etc.) and Ernest's test cases are valid.

I looked at the validator code, and for that part of the content  
detection, we use the module by Bjoern called HTML::Encoding.
-> http://search.cpan.org/src/BJOERN/HTML-Encoding-0.53/lib/HTML/ 
Encoding.pm
-> sub encoding_from_meta_element()
-> sub encoding_from_content_type()
encoding_from_content_type relies on the tokenization of the HTTP  
header from sub split_header_words() in HTTP::Headers::Util (itself  
in libwww-perl)

I'm not convinced the "bug" is in HTML::Encoding. HTML::Encoding  
looks for the "charset" key of the tokenized HTTP header, and it's  
not really reasonable to expect it to look for CHARSET, and charSet,  
etc.

I guess, from the bit of the spec quoted above, the tokenization  
should probably convert the media type parameters to lower case,  
hence when finding
Content-Type: foo/bar; ParaMeter=value
   @values = split_header_words($h->header("Content-Type"));
should return
['foo/bar'=> undef, parameter => 'value']


(Bjoern and Gisle in Bcc in this mail, and will forward this mail to  
cpan bug report for LWP.)

-- 
olivier

Received on Tuesday, 7 August 2007 11:38:08 UTC