- From: Terje Bless <link@tss.no>
- Date: Sat, 28 Apr 2001 04:11:23 +0200
- To: Bjoern Hoehrmann <derhoermi@gmx.net>
- cc: www-validator@w3.org
On 28.04.01 at 03:42, Bjoern Hoehrmann <derhoermi@gmx.net> wrote:
>Only HTML 4.0 and later make this restriction.
I very much would like to avoid special casing on HTML version.
>We have a major conflict between HTTP/1.1 and HTML 4.0 here;
Where was www-qa when we needed them... :-)
>[SNIP "META" kludge] I think this is just horrible and finding a correct
>_and_ usable solution is impossible.
Agreed.
>I think the best thing we can (and should) do is
>
> * report a warning if there is no charset parameter in the HTTP
> response
Someone should write good docs on charsets, problems with them, and help in
selecting and specifying a proper one. HOWTO links for Apache and IIS.
Making this be a warningable -- :-) -- state is problematic insofar as
ciwah et al rams "Latin1 is the default" down people's throats and people
get confused when it produces a warning (been there, done that). A link to
good docs might alleviate this, but this is not something I'm willing to
take any action on until I've checked with Gerald.
> * report a warning if there is (in addition) no charset parameter in
> "the" [1] <meta http-equiv='Content-Type' content='...'> content
> type declaration
> * report a warning if those two are given and don't match
This is Status Quo.
> * use ISO-8859-1 if none of them is given
Ditto, but this follows from the assumption on the semantics of the
HTTP/1.1 Content-Type field. If we change those we'll have to change this
code too.
> * report an error if the content doesn't match the declared encoding
>
> sub is_valid_us_ascii {[...]}
> sub is_valid_utf8 {[...]}
> sub is_valid_latin1 {[...]}
> sub is_valid_windows_1252 {[...]}
>
>I don't know how SP handles invalid input, maybe we can use it to
>perform some of these tasks.
While those regexes impressed the hell out of me -- :-) -- I don't like
this solution. It makes us become an authorative reference on charset
issues and maintaining provably correct implementations of these checks. If
I can get SP to do it (e.g. barf on "illegal" bytes in "this" encoding),
I'd much prefer that. Next alternative is to get Text::Iconv or another
CPAN module to do it (Map8?). Final fallback would be to stuff your code
into a module and nag on you until you released it to CPAN. :-)
I'm going to experiment a bit with SP and see what it can do for us. With
any kind of luck it'll do the trick. The big problem is that we're
converting everything to UTF-8 internally, so by the time it gets to SP
it's too late. The exceptions are US-ASCII and ISO-Latin-1 who get special
treatment.
Received on Friday, 27 April 2001 23:51:20 UTC