Re: Fallback to UTF-8 from Jukka K. Korpela on 2008-05-01 (www-validator@w3.org from May 2008)

From: Jukka K. Korpela <jkorpela@cs.tut.fi>
Date: Thu, 1 May 2008 23:12:07 +0300
To: "W3C Validator Community" <www-validator@w3.org>
Message-ID: <039d01c8abc7$a23bd190$0500000a@DOCENDO>
Nikita The Spider The Spider wrote:

> On Sun, Apr 27, 2008 at 9:43 PM, olivier Thereaux <ot@w3.org> wrote:
>> http://www.ietf.org/rfc/rfc2854.txt
>> " Section 3.7.1, defines that "media subtypes of the 'text' type are
>> defined to have a default charset value of 'ISO-8859-1'"."
>> (ditto RFC 2616)
>>
>> This is the inconsistency at the core of the issue, isn't it.
>
> I agree, and I'm surprised that this topic hasn't received more
> attention in this debate, because it seems like a source for a
> definitive (if unpopular) choice for a default encoding when one isn't
> specified.

I think we are reading too much into the HTTP protocol (RFC 2616) if we 
think that it is intended to set a default value for subtypes of 'text' 
so that these values cannot be overridden in subtype definitions. It 
would be pointless and impractical. Instead, I think the idea is simply 
to set a default for 'text' subtypes for definiteness. That is, the idea 
is to ensure that the encoding is always defined even if there is no 
explicit charset parameter and the media type does not set its own 
default. Admittedly the HTML decision to leave the encoding explicitly 
undefined more or less violates this idea, but this is intentional and 
should be honored, IMHO.

RFC 2854 is informational only and really a mess, an hoc document souped 
up to deal with the transition of all HTML related specs from IETF to 
W3C. The text that Olivier quoted deals with the conflict between MIME 
and HTTP specs, and a merely informational RFC cannot solve the problem. 
But the conflict is irrelevant to HTML issues if we think that all 
specifications allow specific media type definitions to set their own 
defaults.

However, RFC 2854 is correct in the following observation:

   "Using an explicit charset parameter also takes into account that the
   overwhelming majority of deployed browsers are set to use something
   else than 'ISO-8859-1' as the default; the actual default is either a
   corporate character encoding or character encodings widely deployed
   in a certain national or regional community."

This is a good reason not to assume ISO-8859-1 in a validator, because 
it leads to pointless error messages about data characters.

But if I have understood Olivier's comments correctly, the problem is 
that the document data needs to be transcoded into UTF-8, and here you 
cannot just leave bytes undefined or assume that they denote graphic 
characters by some unknown rules. I still think the _best_ approach, for 
a document with unspecified encoding, would be to generate the response 
page so its encoding is unspecified too, with document data characters 
copied as such. But _this_ would probably need some reprogramming.

This leads to the question whether a validator should just say "No". 
That is, when the document's encoding has not been specified, it should 
simply say that and instruct the user how to specify it, with suitable 
references. This means abandoning the idea of helpful checking using 
some guess. The reason is that the results are often not helpful but 
confusing and misleading.

We _know_ that the use should do something about the encoding, and he 
_can_ do it (at least using a <meta> tag), and why should we help him to 
postpone this?


Jukka K. Korpela ("Yucca")
http://www.cs.tut.fi/~jkorpela/
Received on Thursday, 1 May 2008 20:12:48 UTC