Re: 8-bit chars in US-ASCII documents (was Re: Embarrassing typo!) from Terje Bless on 2001-04-28 (www-validator@w3.org from April 2001)

From: Terje Bless <link@tss.no>
Date: Sat, 28 Apr 2001 04:11:23 +0200
To: Bjoern Hoehrmann <derhoermi@gmx.net>
cc: www-validator@w3.org
Message-ID: <20010428055113-b01010701-200e5de3@192.146.238.90>

On 28.04.01 at 03:42, Bjoern Hoehrmann <derhoermi@gmx.net> wrote:

>Only HTML 4.0 and later make this restriction.

I very much would like to avoid special casing on HTML version.

>We have a major conflict between HTTP/1.1 and HTML 4.0 here;

Where was www-qa when we needed them... :-)

>[SNIP "META" kludge] I think this is just horrible and finding a correct
>_and_ usable solution is impossible.

Agreed.

>I think the best thing we can (and should) do is
>
>  * report a warning if there is no charset parameter in the HTTP
>    response

Someone should write good docs on charsets, problems with them, and help in
selecting and specifying a proper one. HOWTO links for Apache and IIS.

Making this be a warningable -- :-) -- state is problematic insofar as
ciwah et al rams "Latin1 is the default" down people's throats and people
get confused when it produces a warning (been there, done that). A link to
good docs might alleviate this, but this is not something I'm willing to
take any action on until I've checked with Gerald.

>  * report a warning if there is (in addition) no charset parameter in
>    "the" [1] <meta http-equiv='Content-Type' content='...'> content
>    type declaration
>  * report a warning if those two are given and don't match

This is Status Quo.

>  * use ISO-8859-1 if none of them is given

Ditto, but this follows from the assumption on the semantics of the
HTTP/1.1 Content-Type field. If we change those we'll have to change this
code too.

>  * report an error if the content doesn't match the declared encoding
>
>    sub is_valid_us_ascii     {[...]}
>    sub is_valid_utf8         {[...]}
>    sub is_valid_latin1       {[...]}
>    sub is_valid_windows_1252 {[...]}
>
>I don't know how SP handles invalid input, maybe we can use it to
>perform some of these tasks.

While those regexes impressed the hell out of me -- :-) -- I don't like
this solution. It makes us become an authorative reference on charset
issues and maintaining provably correct implementations of these checks. If
I can get SP to do it (e.g. barf on "illegal" bytes in "this" encoding),
I'd much prefer that. Next alternative is to get Text::Iconv or another
CPAN module to do it (Map8?). Final fallback would be to stuff your code
into a module and nag on you until you released it to CPAN. :-)

I'm going to experiment a bit with SP and see what it can do for us. With
any kind of luck it'll do the trick. The big problem is that we're
converting everything to UTF-8 internally, so by the time it gets to SP
it's too late. The exceptions are US-ASCII and ISO-Latin-1 who get special
treatment.

Received on Friday, 27 April 2001 23:51:20 UTC