flakey charset detection from David Brownell on 2002-12-04 (www-validator@w3.org from December 2002)

From: David Brownell <david-b@pacbell.net>
Date: Wed, 04 Dec 2002 08:01:12 -0800
To: www-validator@w3.org
Message-id: <3DEE26C8.60708@pacbell.net>

I recently validated a xhtml 1.0 page that used to validate just fine, and
instead, I got a message that said things like:

    I was not able to extract a character encoding labeling from any of
    the valid sources for such information. Without encoding information
    it is impossible to validate the document. The sources I tried are:

      * The HTTP Content-Type field.
      * The XML Declaration.
      * The HTML "META" element.

    And I even tried to autodetect it using the algorithm defined in
    Appendix F of the XML 1.0 Recommendation.

This seems pretty bogus.  HTTP defaults to iso-8859-1, and the
validator can+should know that character encoding is the default.

Or have people been playing with charset detection policies again?

- Dave

p.s. Given that it's XHTML, I find the fact that it even _tried_
      using the META element to be worrisome ... that means that
      parsing this document as XML could give different results,
      which breaks all XHTML goals I ever heard.  Not that I've
      tracked XHTML recently, but this seems like trouble.

Received on Wednesday, 4 December 2002 11:37:06 UTC