Re: 8-bit chars in US-ASCII documents (was Re: Embarrassing typo!)

From: Bjoern Hoehrmann (derhoermi@gmx.net)
Date: Fri, Apr 27 2001

  • Next message: Terje Bless: "Re: 8-bit chars in US-ASCII documents (was Re: Embarrassing typo!)"

    From: Bjoern Hoehrmann <derhoermi@gmx.net>
    To: Terje Bless <link@tss.no>
    Cc: www-validator@w3.org
    Date: Sat, 28 Apr 2001 03:42:16 +0200
    Message-ID: <dv4ket89l1g6erqvlqp5iu909soqv4glte@4ax.com>
    Subject: Re: 8-bit chars in US-ASCII documents (was Re: Embarrassing typo!)
    
    * Terje Bless wrote:
    >>>>Btw. this is, as I'm sure you know, worse for HTML documents. XML
    >>>>documents can be encoded in UTF-8 or UTF-16 without declaring it,
    >>>>HTML can't, you must always declare the used encoding, since the user
    >>>>agent must not assume any default character encoding.
    >>>
    >>>IIRC, we still have that ISO-8859-1 default from the HTTP/1.1 spec, non?
    >>
    >>See HTML 4.01 section 5.2.2, 'Therefore, user agents must not assume any
    >>default value for the "charset" parameter'.
    >
    >How practical is it to put this into production? If the validator makes no
    >assumptions, will it make people fix their servers? Should this be
    >retroactively applied to earlier HTML versions? What says the W3C HTML
    >Reccomendation overrules the IETF's HTTP Standard?
    
    Only HTML 4.0 and later make this restriction. We have a major conflict
    between HTTP/1.1 and HTML 4.0 here; HTTP/1.1 does not only define
    ISO-8859-1 as the default encoding assumption, it rather states in
    section 19.3 that "not labeling the entity is preferred over labeling
    the entity with the labels US-ASCII or ISO-8859-1". RFC 2854 strongly
    recommends the use of an explicit charset parameter. Even worse, HTML 4
    enables authors to use a meta element to set/override HTTP headers. I'm
    not sure whether a meta element overrides the sent HTTP header, HTML 4
    only says in section 7.4.4 for the http-equiv attribute: "HTTP servers
    use this attribute to gather information for HTTP response message
    headers", I don't think any server developer ever took this serious (I
    wouldn't, too)... I think this is just horrible and finding a correct
    _and_ usable solution is impossible.
    
    I think the best thing we can (and should) do is
    
      * report a warning if there is no charset parameter in the HTTP
        response
      * report a warning if there is (in addition) no charset parameter in
        "the" [1] <meta http-equiv='Content-Type' content='...'> content
        type declaration
      * use ISO-8859-1 if none of them is given
      * report a warning if those two are given and don't match
      * report an error if the content doesn't match the declared encoding
    
    I can contribute code for the last item:
    
        sub is_valid_us_ascii { shift =~ /^[\x00-\x7f]*$/ }
        
        sub is_valid_utf8
        {
            shift =~ /^(?:[\xC2-\xDF][\x80-\xBF]{1} |
                          [\xE0-\xEF][\x80-\xBF]{2} |
                          [\xF0-\xF7][\x80-\xBF]{3} |
                          [\xF8-\xFB][\x80-\xBF]{4} |
                          [\xFC-\xFD][\x80-\xBF]{5} |
                          [\x00-\x7f])*$/x;
        
        }
        
        sub is_valid_latin1
        {
            shift =~ /^[\x00-\x7f\xA0-\xFF]*$/
        }
    
        sub is_valid_windows_1252 { 1 }
    
    I don't know how SP handles invalid input, maybe we can use it to
    perform some of these tasks.
    
    [1] HTML 4.01 doesn't say what to do if there is more than one element
        with the same http-equiv value
    -- 
    Björn Höhrmann { mailto:bjoern@hoehrmann.de } http://www.bjoernsworld.de
    am Badedeich 7 } Telefon: +49(0)4667/981028 { http://bjoern.hoehrmann.de
    25899 Dagebüll { PGP Pub. KeyID: 0xA4357E78 } http://www.learn.to/quote/