Re: default charset broken

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Kjetil Torgrim Homme <kjetilho@ifi.uio.no> wrote:

>I maintain a small mostly static site.  the pages used to validate fine,
>but in your latest update of the validator you ignore the HTTP RFC and
>broke the charset detection.
>
>I hope you can fix this.  it would be sad to have to remove the link,
>since I think the validator has been a good incentive for following and
>for promoting standards.  but when the validator itself breaks the
>standard, there isn't much point ...
>
>(please CC me any responses, I haven't joined the list.)

...and later...

>>Could you please be more specific what is broken in what version
>>of the MarkUp Validator?
>
>oh, sorry.  the default charset for text/* is "ISO-8859-1", but the
>validator treats it as unknown.
>
>see 3.4.1 and 3.7.1 in RFC 2616.

Uhm, what a wonderfully confrontationally phrased bug report; you're usually
much more carefull with your formulation in no.* Kjetil! :-)

The relevant parts of the cited sections of RFC 2616 read in full:

[[[
  3.4.1 Missing Charset 

  Some HTTP/1.0 software has interpreted a Content-Type header without
  charset parameter incorrectly to mean "recipient should guess."
  Senders wishing to defeat this behavior MAY include a charset parameter
  even when the charset is ISO-8859-1 and SHOULD do so when it is known
  that it will not confuse the recipient.

  Unfortunately, some older HTTP/1.0 clients did not deal properly with
  an explicit charset parameter. HTTP/1.1 recipients MUST respect the
  charset label provided by the sender; and those user agents that have
  a provision to "guess" a charset MUST use the charset from the
  content-type field if they support that charset, rather than the
  recipient's preference, when initially displaying a document. See
  section 3.7.1.
]]] - RFC2616 3.4.1

[[[
  The "charset" parameter is used with some media types to define the
  character set (section 3.4) of the data. When no explicit charset
  parameter is provided by the sender, media subtypes of the "text"
  type are defined to have a default charset value of "ISO-8859-1" when
  received via HTTP. Data in character sets other than "ISO-8859-1" or
  its subsets MUST be labeled with an appropriate charset value. See
  section 3.4.1 for compatibility problems. 
]]] - RFC2616 3.7.1

Which appears to support your claim. Unfortunately, the HTML 4.01
Recommendation, Section 5.2.2, reads:

[[[
  5.2.2 Specifying the character encoding 
  [...]
  The HTTP protocol ([RFC2616], section 3.7.1) mentions ISO-8859-1 as a
  default character encoding when the "charset" parameter is absent from
  the "Content-Type" header field. In practice, this recommendation has
  proved useless because some servers don't allow a "charset" parameter
  to be sent, and others may not be configured to send the parameter.

  Therefore, user agents [MUST NOT] assume any default value for the
  "charset" parameter.
]]] - W3C HTML 4.01 Recommendation 5.2.2

Which puts us in a right pretty pickle.

We've been over this discussion ad nauseum on this list several times before.
The bottom line is that RFC 2616 and the HTML 4.01 Recommendation (and, by
extension, XHTML as well[0]) are incompatible on this point[1] and the _only_
safe way to achieve the correct character encoding for your documents is to
explicitly specify it in the HTTP «Content-Type» header.

Now, as of the current released version (which is version 0.6.1, BTW), the
Validator will refuse to Validate a document with no character encoding
specified in any of the nominally allowed places. In a slightly
controversial[2] change, the current beta release (0.6.2[3]) will proceede
with the validation, but will complain loudly and not label the document valid
until it finds a character encoding.

This new behaviour goes some way towards addressing your concern, but you will
still find your documents labelled Invalid unless you specify a character
encoding.


I would strongly encourage you to explicitly specify the character encoding.
In particular, I direct your attention to the part of RFC2616 3.4.1 which
reads: «Senders wishing to defeat this behavior MAY include a charset
parameter even when the charset is ISO-8859-1 ***and SHOULD do so when it is
known that it will not confuse the recipient.***» [emphasis added].

In this particular case, not only is it known that specifying the encoding
will not confuse the recipient; explicitly specifying it is the only way to
_avoid_ confusing «the recipient» (IOW, the «SHOULD» certainly kicks in).




[0] - With the added complication that XHTML superficially is meant to
      obey XML defaulting rules for character encoding (e.g. unlabelled
      usually means UTF-8).

[1] - And, yes, we have also been over whether it is appropriate for
      HTML to override HTTP. :-)

[2] - As in: I fought it tooth and nail but was finally persuaded that
      this behaviour was the best overall for the goal of persuading the
      web community to produce valid HTML. :-)

[3] - <http://validator.w3.org:8001/>. Feedback encouraged!


- -- 
As a cat owner, I know this for a fact... Nothing says "I love you" like a
decapitated gopher on your front porch.

-----BEGIN PGP SIGNATURE-----
Version: PGP SDK 3.0.2

iQA/AwUBPuH8e6PyPrIkdfXsEQIkdwCgwQ2wUz1/aUOsuDpS2hdSgZwA2AcAoNsD
ocn/dj0xA9N/GbOxxVX2+6KF
=KpnD
-----END PGP SIGNATURE-----

Received on Saturday, 7 June 2003 10:54:09 UTC