Re: default charset broken from Kjetil Torgrim Homme on 2003-06-08 (www-validator@w3.org from June 2003)

From: Kjetil Torgrim Homme <kjetilho@ifi.uio.no>
Date: Sun, 08 Jun 2003 10:45:37 +0200
To: W3C Validator <www-validator@w3.org>
Message-ID: <1rznkths9a.fsf@vingodur.ifi.uio.no>
[Terje Bless]:
>
>   No argument from me there. In fact I consider it a bug in HTML 4.0
>   that they meddle with what is IMO the provenance of HTTP, and a
>   bug in HTTP that they meddle with what is the provenance of MIME.

agreed.

>   > do you really think so?  I find that very hard to believe,
>   > especially since HTML 4 isn't even an IETF standard.
>   
>   And the W3C isn't, and doesn't claim to be, a recognized standards
>   body. But given this is the _W3C_ Markup Validator we kinda have
>   to accept its authority as given, non? :-)

:-)

>   But my point was that even if both documents were produced under
>   the aegis of the IETF, if HTML passed IETF Last Call with no
>   substantive complaints then it would have quite legally superseded
>   this provisio from HTTP.

ah.  yes, if it was a proposed standard.

>   If this was not acceptable to the IETF, the Area Director or the
>   RFC Editor should have addressed the issue prior to publication as
>   a standards track RFC.

exactly.

>   Case in point; RFC1036 (netnews) manages the neat trick of saying
>   a) that it borrows a majority of its syntax from RFC822 (email),
>   b) that where the two diverge RFC822 is to be considered
>   authorative, _and_ c) goes merrily on its way superseding and
>   modifying both syntax and semantics of common header
>   fields. RFC1036 is still considered authorative (albeit badly out
>   of touch with reality) within the IETF.

this is a bit off topic, but 1036 never was a proposed standard.  it
was also written a long time ago, in 1987, when IETF's procedures were
less stringent.  but I'm not sure I see the conflict, anyway.  an
RFC-1036 message MUST parse as an RFC-822 message, but not the other
way around.  for instance, the Message-ID header has more restrictive
syntax (no spaces allowed), but any RFC-1036 msg-id is allowed by
RFC-822.  similarily with References, where RFC-1036 only allows
msg-ids separated by a single space, but RFC-822 also allows atoms and
quoted-strings.

>   I agree; one of the two must yield. We have implemented a solution
>   based on RFC2616 yielding. Think OO; we import HTTP and override
>   its CharsetDefaulting method instead of throwing a
>   InvalidAccessException. :-)

that's a new subclass, so it is no longer HTTP... :-)

>   I'll grant that the issue is debateable though. Ours is but one of
>   (at least) two valid interpretations. And I'm not even certain
>   everyone involved in the validator is in perfect agreement on this
>   either. The status quo is probably best described as the rough
>   consensus somewhat biased by what I percieved the least-harmfull /
>   overall-most-usefull behaviour was, given the circumstances.

I propose this order

   explicit HTTP charset
   META HTTP-EQUIV
   charset attribute
   implicit HTTP (== ISO-8859-1)

that's it.  any guesswork should not influence parsing and status as
valid/invalid.  however, feel free to add big flashing warnings if the
file starts with 0xFE 0xFF (or 0xFF 0xFE, ugh), or if _all_ 8-bit
characters are part of valid UTF-8 encodings, etc.

-- 
Kjetil T.
Received on Sunday, 8 June 2003 04:45:41 UTC