Re: 8-bit chars in US-ASCII documents (was Re: Embarrassing typo!) from Bjoern Hoehrmann on 2001-04-28 (www-validator@w3.org from April 2001)

From: Bjoern Hoehrmann <derhoermi@gmx.net>
Date: Sat, 28 Apr 2001 03:42:16 +0200
To: Terje Bless <link@tss.no>
Cc: www-validator@w3.org
Message-ID: <dv4ket89l1g6erqvlqp5iu909soqv4glte@4ax.com>

* Terje Bless wrote:
>>>>Btw. this is, as I'm sure you know, worse for HTML documents. XML
>>>>documents can be encoded in UTF-8 or UTF-16 without declaring it,
>>>>HTML can't, you must always declare the used encoding, since the user
>>>>agent must not assume any default character encoding.
>>>
>>>IIRC, we still have that ISO-8859-1 default from the HTTP/1.1 spec, non?
>>
>>See HTML 4.01 section 5.2.2, 'Therefore, user agents must not assume any
>>default value for the "charset" parameter'.
>
>How practical is it to put this into production? If the validator makes no
>assumptions, will it make people fix their servers? Should this be
>retroactively applied to earlier HTML versions? What says the W3C HTML
>Reccomendation overrules the IETF's HTTP Standard?

Only HTML 4.0 and later make this restriction. We have a major conflict
between HTTP/1.1 and HTML 4.0 here; HTTP/1.1 does not only define
ISO-8859-1 as the default encoding assumption, it rather states in
section 19.3 that "not labeling the entity is preferred over labeling
the entity with the labels US-ASCII or ISO-8859-1". RFC 2854 strongly
recommends the use of an explicit charset parameter. Even worse, HTML 4
enables authors to use a meta element to set/override HTTP headers. I'm
not sure whether a meta element overrides the sent HTTP header, HTML 4
only says in section 7.4.4 for the http-equiv attribute: "HTTP servers
use this attribute to gather information for HTTP response message
headers", I don't think any server developer ever took this serious (I
wouldn't, too)... I think this is just horrible and finding a correct
_and_ usable solution is impossible.

I think the best thing we can (and should) do is

  * report a warning if there is no charset parameter in the HTTP
    response
  * report a warning if there is (in addition) no charset parameter in
    "the" [1] <meta http-equiv='Content-Type' content='...'> content
    type declaration
  * use ISO-8859-1 if none of them is given
  * report a warning if those two are given and don't match
  * report an error if the content doesn't match the declared encoding

I can contribute code for the last item:

    sub is_valid_us_ascii { shift =~ /^[\x00-\x7f]*$/ }

    sub is_valid_utf8
    {
        shift =~ /^(?:[\xC2-\xDF][\x80-\xBF]{1} |
                      [\xE0-\xEF][\x80-\xBF]{2} |
                      [\xF0-\xF7][\x80-\xBF]{3} |
                      [\xF8-\xFB][\x80-\xBF]{4} |
                      [\xFC-\xFD][\x80-\xBF]{5} |
                      [\x00-\x7f])*$/x;

    }

    sub is_valid_latin1
    {
        shift =~ /^[\x00-\x7f\xA0-\xFF]*$/
    }

    sub is_valid_windows_1252 { 1 }

I don't know how SP handles invalid input, maybe we can use it to
perform some of these tasks.

[1] HTML 4.01 doesn't say what to do if there is more than one element
    with the same http-equiv value
-- 
Björn Höhrmann { mailto:bjoern@hoehrmann.de } http://www.bjoernsworld.de
am Badedeich 7 } Telefon: +49(0)4667/981028 { http://bjoern.hoehrmann.de
25899 Dagebüll { PGP Pub. KeyID: 0xA4357E78 } http://www.learn.to/quote/

Received on Friday, 27 April 2001 21:41:23 UTC