Re: Autodetection failure from Martin Duerst on 2002-12-10 (www-validator@w3.org from December 2002)

From: Martin Duerst <duerst@w3.org>
Date: Wed, 11 Dec 2002 06:36:28 +0900
To: Terje Bless <link@pobox.com>, W3C Validator <www-validator@w3.org>
Cc: Elliotte Rusty Harold <elharo@metalab.unc.edu>
Message-Id: <4.2.0.58.J.20021211063236.04b902a8@localhost>

At 07:54 02/12/09 +0100, Terje Bless wrote:
>Elliotte Rusty Harold <elharo@metalab.unc.edu> wrote:

> >I believe that in this case for XHTML, the fallback should be UTF-8. It
> >certainly is for XML, and I don't think there's any reason XHTML should
> >be different. If everything else fails, assume UTF-8.
>
>Hmmm. I must admit I'm somewht fuzzy on the details here, but IIRC that for
>XML to be transported without explicit encoding information it must contain
>an XML Declaration. At least, the autodetect algorithm in Appendix F of the
>XML 1.0 Recommendation relies on there being either an XML Declaration or a
>UNICODE Byte-Order Mark in the absense of encoding information from a
>higher-level protocol (i.e. HTTP).

That's not exactly correct. Appendix F lists 'everything else' as
being UTF-8. But please note that the absence of a 'charset' parameter
on a Content-Type header and the absence of charset information are
not exactly the same.

On practical terms, assuming UTF-8 has the advantage that there is
a high chance that a mistake (i.e. something actually not UTF-8)
is being caught; for many other encodings, that chance is much lower.

>Which Content-Type the document was served/uploaded as will also affect the
>character encoding determination as the different types have different
>defaults and defaulting behaviour in this regard.

Yes indeed.

Regards,   Martin.

Received on Tuesday, 10 December 2002 16:37:09 UTC