- From: Jukka K. Korpela <jkorpela@cs.tut.fi>
- Date: Thu, 1 May 2008 23:12:07 +0300
- To: "W3C Validator Community" <www-validator@w3.org>
Nikita The Spider The Spider wrote:
> On Sun, Apr 27, 2008 at 9:43 PM, olivier Thereaux <ot@w3.org> wrote:
>> http://www.ietf.org/rfc/rfc2854.txt
>> " Section 3.7.1, defines that "media subtypes of the 'text' type are
>> defined to have a default charset value of 'ISO-8859-1'"."
>> (ditto RFC 2616)
>>
>> This is the inconsistency at the core of the issue, isn't it.
>
> I agree, and I'm surprised that this topic hasn't received more
> attention in this debate, because it seems like a source for a
> definitive (if unpopular) choice for a default encoding when one isn't
> specified.
I think we are reading too much into the HTTP protocol (RFC 2616) if we
think that it is intended to set a default value for subtypes of 'text'
so that these values cannot be overridden in subtype definitions. It
would be pointless and impractical. Instead, I think the idea is simply
to set a default for 'text' subtypes for definiteness. That is, the idea
is to ensure that the encoding is always defined even if there is no
explicit charset parameter and the media type does not set its own
default. Admittedly the HTML decision to leave the encoding explicitly
undefined more or less violates this idea, but this is intentional and
should be honored, IMHO.
RFC 2854 is informational only and really a mess, an hoc document souped
up to deal with the transition of all HTML related specs from IETF to
W3C. The text that Olivier quoted deals with the conflict between MIME
and HTTP specs, and a merely informational RFC cannot solve the problem.
But the conflict is irrelevant to HTML issues if we think that all
specifications allow specific media type definitions to set their own
defaults.
However, RFC 2854 is correct in the following observation:
"Using an explicit charset parameter also takes into account that the
overwhelming majority of deployed browsers are set to use something
else than 'ISO-8859-1' as the default; the actual default is either a
corporate character encoding or character encodings widely deployed
in a certain national or regional community."
This is a good reason not to assume ISO-8859-1 in a validator, because
it leads to pointless error messages about data characters.
But if I have understood Olivier's comments correctly, the problem is
that the document data needs to be transcoded into UTF-8, and here you
cannot just leave bytes undefined or assume that they denote graphic
characters by some unknown rules. I still think the _best_ approach, for
a document with unspecified encoding, would be to generate the response
page so its encoding is unspecified too, with document data characters
copied as such. But _this_ would probably need some reprogramming.
This leads to the question whether a validator should just say "No".
That is, when the document's encoding has not been specified, it should
simply say that and instruct the user how to specify it, with suitable
references. This means abandoning the idea of helpful checking using
some guess. The reason is that the results are often not helpful but
confusing and misleading.
We _know_ that the use should do something about the encoding, and he
_can_ do it (at least using a <meta> tag), and why should we help him to
postpone this?
Jukka K. Korpela ("Yucca")
http://www.cs.tut.fi/~jkorpela/
Received on Thursday, 1 May 2008 20:12:48 UTC