- From: Bjoern Hoehrmann <derhoermi@gmx.net>
- Date: Sat, 28 Apr 2001 03:42:16 +0200
- To: Terje Bless <link@tss.no>
- Cc: www-validator@w3.org
* Terje Bless wrote: >>>>Btw. this is, as I'm sure you know, worse for HTML documents. XML >>>>documents can be encoded in UTF-8 or UTF-16 without declaring it, >>>>HTML can't, you must always declare the used encoding, since the user >>>>agent must not assume any default character encoding. >>> >>>IIRC, we still have that ISO-8859-1 default from the HTTP/1.1 spec, non? >> >>See HTML 4.01 section 5.2.2, 'Therefore, user agents must not assume any >>default value for the "charset" parameter'. > >How practical is it to put this into production? If the validator makes no >assumptions, will it make people fix their servers? Should this be >retroactively applied to earlier HTML versions? What says the W3C HTML >Reccomendation overrules the IETF's HTTP Standard? Only HTML 4.0 and later make this restriction. We have a major conflict between HTTP/1.1 and HTML 4.0 here; HTTP/1.1 does not only define ISO-8859-1 as the default encoding assumption, it rather states in section 19.3 that "not labeling the entity is preferred over labeling the entity with the labels US-ASCII or ISO-8859-1". RFC 2854 strongly recommends the use of an explicit charset parameter. Even worse, HTML 4 enables authors to use a meta element to set/override HTTP headers. I'm not sure whether a meta element overrides the sent HTTP header, HTML 4 only says in section 7.4.4 for the http-equiv attribute: "HTTP servers use this attribute to gather information for HTTP response message headers", I don't think any server developer ever took this serious (I wouldn't, too)... I think this is just horrible and finding a correct _and_ usable solution is impossible. I think the best thing we can (and should) do is * report a warning if there is no charset parameter in the HTTP response * report a warning if there is (in addition) no charset parameter in "the" [1] <meta http-equiv='Content-Type' content='...'> content type declaration * use ISO-8859-1 if none of them is given * report a warning if those two are given and don't match * report an error if the content doesn't match the declared encoding I can contribute code for the last item: sub is_valid_us_ascii { shift =~ /^[\x00-\x7f]*$/ } sub is_valid_utf8 { shift =~ /^(?:[\xC2-\xDF][\x80-\xBF]{1} | [\xE0-\xEF][\x80-\xBF]{2} | [\xF0-\xF7][\x80-\xBF]{3} | [\xF8-\xFB][\x80-\xBF]{4} | [\xFC-\xFD][\x80-\xBF]{5} | [\x00-\x7f])*$/x; } sub is_valid_latin1 { shift =~ /^[\x00-\x7f\xA0-\xFF]*$/ } sub is_valid_windows_1252 { 1 } I don't know how SP handles invalid input, maybe we can use it to perform some of these tasks. [1] HTML 4.01 doesn't say what to do if there is more than one element with the same http-equiv value -- Björn Höhrmann { mailto:bjoern@hoehrmann.de } http://www.bjoernsworld.de am Badedeich 7 } Telefon: +49(0)4667/981028 { http://bjoern.hoehrmann.de 25899 Dagebüll { PGP Pub. KeyID: 0xA4357E78 } http://www.learn.to/quote/
Received on Friday, 27 April 2001 21:41:23 UTC