- From: Jukka K. Korpela <jkorpela@cs.tut.fi>
- Date: Thu, 1 May 2008 23:12:07 +0300
- To: "W3C Validator Community" <www-validator@w3.org>
Nikita The Spider The Spider wrote: > On Sun, Apr 27, 2008 at 9:43 PM, olivier Thereaux <ot@w3.org> wrote: >> http://www.ietf.org/rfc/rfc2854.txt >> " Section 3.7.1, defines that "media subtypes of the 'text' type are >> defined to have a default charset value of 'ISO-8859-1'"." >> (ditto RFC 2616) >> >> This is the inconsistency at the core of the issue, isn't it. > > I agree, and I'm surprised that this topic hasn't received more > attention in this debate, because it seems like a source for a > definitive (if unpopular) choice for a default encoding when one isn't > specified. I think we are reading too much into the HTTP protocol (RFC 2616) if we think that it is intended to set a default value for subtypes of 'text' so that these values cannot be overridden in subtype definitions. It would be pointless and impractical. Instead, I think the idea is simply to set a default for 'text' subtypes for definiteness. That is, the idea is to ensure that the encoding is always defined even if there is no explicit charset parameter and the media type does not set its own default. Admittedly the HTML decision to leave the encoding explicitly undefined more or less violates this idea, but this is intentional and should be honored, IMHO. RFC 2854 is informational only and really a mess, an hoc document souped up to deal with the transition of all HTML related specs from IETF to W3C. The text that Olivier quoted deals with the conflict between MIME and HTTP specs, and a merely informational RFC cannot solve the problem. But the conflict is irrelevant to HTML issues if we think that all specifications allow specific media type definitions to set their own defaults. However, RFC 2854 is correct in the following observation: "Using an explicit charset parameter also takes into account that the overwhelming majority of deployed browsers are set to use something else than 'ISO-8859-1' as the default; the actual default is either a corporate character encoding or character encodings widely deployed in a certain national or regional community." This is a good reason not to assume ISO-8859-1 in a validator, because it leads to pointless error messages about data characters. But if I have understood Olivier's comments correctly, the problem is that the document data needs to be transcoded into UTF-8, and here you cannot just leave bytes undefined or assume that they denote graphic characters by some unknown rules. I still think the _best_ approach, for a document with unspecified encoding, would be to generate the response page so its encoding is unspecified too, with document data characters copied as such. But _this_ would probably need some reprogramming. This leads to the question whether a validator should just say "No". That is, when the document's encoding has not been specified, it should simply say that and instruct the user how to specify it, with suitable references. This means abandoning the idea of helpful checking using some guess. The reason is that the results are often not helpful but confusing and misleading. We _know_ that the use should do something about the encoding, and he _can_ do it (at least using a <meta> tag), and why should we help him to postpone this? Jukka K. Korpela ("Yucca") http://www.cs.tut.fi/~jkorpela/
Received on Thursday, 1 May 2008 20:12:48 UTC