Re: charset parameter from Martin Duerst on 2001-07-26 (www-validator@w3.org from July 2001)

From: Martin Duerst <duerst@w3.org>
Date: Thu, 26 Jul 2001 11:22:18 +0900
To: Lloyd Wood <L.Wood@eim.surrey.ac.uk>, Terje Bless <link@pobox.com>
Cc: Bjoern Hoehrmann <derhoermi@gmx.net>, www-validator@w3.org
Message-Id: <4.2.0.58.J.20010726111922.05b632f0@sh.w3.mag.keio.ac.jp>

At 14:03 01/07/25 +0100, Lloyd Wood wrote:
>On Wed, 25 Jul 2001, Terje Bless wrote:
>
> > The issue is that the transport protocol sez that an absense of an explicit
> > charset parameter on the Content-Type means "ISO-8859-1"; HTML or XML rules
> > don't apply here. When it comes time to parse the markup, you already have
> > a charset; the XML/HTML rules do not govern HTTP.
>
>well, that's handy.

But as I wrote, it's not correct.

>I've always wondered how you define the charset for the line that
>defines the charset so that you can interpret it.

The HTTP headers are defined to be in ASCII. For the 'in-document'
information, either you assume ASCII (for HTML) or there are more
complicated heuristics (see XML app. F). The validator currently
assumes ASCII (or anything compatible with it).

> > In practice you have to decide between "Assume ISO-8859-1 as that's what
> > /people/ tend to assume" or "Assume nothing as people will get it wrong
> > some part of the time".
>
>I don't see how you can ever assume nothing.

Well, for the validator, 'assume nothing' just means 'document
doesn't validate'. That's quite easy :-).

Regards,   Martin.

Received on Wednesday, 25 July 2001 22:23:27 UTC