- From: Terje Bless <link@pobox.com>
- Date: Sat, 7 Jun 2003 16:53:49 +0200
- To: W3C Validator <www-validator@w3.org>
- cc: Kjetil Torgrim Homme <kjetilho@ifi.uio.no>
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Kjetil Torgrim Homme <kjetilho@ifi.uio.no> wrote: >I maintain a small mostly static site. the pages used to validate fine, >but in your latest update of the validator you ignore the HTTP RFC and >broke the charset detection. > >I hope you can fix this. it would be sad to have to remove the link, >since I think the validator has been a good incentive for following and >for promoting standards. but when the validator itself breaks the >standard, there isn't much point ... > >(please CC me any responses, I haven't joined the list.) ...and later... >>Could you please be more specific what is broken in what version >>of the MarkUp Validator? > >oh, sorry. the default charset for text/* is "ISO-8859-1", but the >validator treats it as unknown. > >see 3.4.1 and 3.7.1 in RFC 2616. Uhm, what a wonderfully confrontationally phrased bug report; you're usually much more carefull with your formulation in no.* Kjetil! :-) The relevant parts of the cited sections of RFC 2616 read in full: [[[ 3.4.1 Missing Charset Some HTTP/1.0 software has interpreted a Content-Type header without charset parameter incorrectly to mean "recipient should guess." Senders wishing to defeat this behavior MAY include a charset parameter even when the charset is ISO-8859-1 and SHOULD do so when it is known that it will not confuse the recipient. Unfortunately, some older HTTP/1.0 clients did not deal properly with an explicit charset parameter. HTTP/1.1 recipients MUST respect the charset label provided by the sender; and those user agents that have a provision to "guess" a charset MUST use the charset from the content-type field if they support that charset, rather than the recipient's preference, when initially displaying a document. See section 3.7.1. ]]] - RFC2616 3.4.1 [[[ The "charset" parameter is used with some media types to define the character set (section 3.4) of the data. When no explicit charset parameter is provided by the sender, media subtypes of the "text" type are defined to have a default charset value of "ISO-8859-1" when received via HTTP. Data in character sets other than "ISO-8859-1" or its subsets MUST be labeled with an appropriate charset value. See section 3.4.1 for compatibility problems. ]]] - RFC2616 3.7.1 Which appears to support your claim. Unfortunately, the HTML 4.01 Recommendation, Section 5.2.2, reads: [[[ 5.2.2 Specifying the character encoding [...] The HTTP protocol ([RFC2616], section 3.7.1) mentions ISO-8859-1 as a default character encoding when the "charset" parameter is absent from the "Content-Type" header field. In practice, this recommendation has proved useless because some servers don't allow a "charset" parameter to be sent, and others may not be configured to send the parameter. Therefore, user agents [MUST NOT] assume any default value for the "charset" parameter. ]]] - W3C HTML 4.01 Recommendation 5.2.2 Which puts us in a right pretty pickle. We've been over this discussion ad nauseum on this list several times before. The bottom line is that RFC 2616 and the HTML 4.01 Recommendation (and, by extension, XHTML as well[0]) are incompatible on this point[1] and the _only_ safe way to achieve the correct character encoding for your documents is to explicitly specify it in the HTTP «Content-Type» header. Now, as of the current released version (which is version 0.6.1, BTW), the Validator will refuse to Validate a document with no character encoding specified in any of the nominally allowed places. In a slightly controversial[2] change, the current beta release (0.6.2[3]) will proceede with the validation, but will complain loudly and not label the document valid until it finds a character encoding. This new behaviour goes some way towards addressing your concern, but you will still find your documents labelled Invalid unless you specify a character encoding. I would strongly encourage you to explicitly specify the character encoding. In particular, I direct your attention to the part of RFC2616 3.4.1 which reads: «Senders wishing to defeat this behavior MAY include a charset parameter even when the charset is ISO-8859-1 ***and SHOULD do so when it is known that it will not confuse the recipient.***» [emphasis added]. In this particular case, not only is it known that specifying the encoding will not confuse the recipient; explicitly specifying it is the only way to _avoid_ confusing «the recipient» (IOW, the «SHOULD» certainly kicks in). [0] - With the added complication that XHTML superficially is meant to obey XML defaulting rules for character encoding (e.g. unlabelled usually means UTF-8). [1] - And, yes, we have also been over whether it is appropriate for HTML to override HTTP. :-) [2] - As in: I fought it tooth and nail but was finally persuaded that this behaviour was the best overall for the goal of persuading the web community to produce valid HTML. :-) [3] - <http://validator.w3.org:8001/>. Feedback encouraged! - -- As a cat owner, I know this for a fact... Nothing says "I love you" like a decapitated gopher on your front porch. -----BEGIN PGP SIGNATURE----- Version: PGP SDK 3.0.2 iQA/AwUBPuH8e6PyPrIkdfXsEQIkdwCgwQ2wUz1/aUOsuDpS2hdSgZwA2AcAoNsD ocn/dj0xA9N/GbOxxVX2+6KF =KpnD -----END PGP SIGNATURE-----
Received on Saturday, 7 June 2003 10:54:09 UTC