- From: olivier Thereaux <ot@w3.org>
- Date: Mon, 28 Apr 2008 15:42:45 +0900
- To: Jukka K.Korpela <jkorpela@cs.tut.fi>
- Cc: "W3C Validator Community" <www-validator@w3.org>
Hi Jukka, Thanks for your feedback. On 28-Apr-08, at 3:05 PM, Jukka K. Korpela wrote: >> * utf-8, because it is the future-looking encoding, also appropriate >> for most international content. > > Future-looking does not apply here. We are dealing with an _error > condition_: the encoding has not been specified, and you (we) have > decided that the validator should make a guess, as extra comfort to > the > user. More and more authoring tools are and (my bet on the future) will be using utf-8 as a default. Considering it a worthy guess in the case of non-labeled content is thus future-looking. > In the given situation, the validator should encourage the user to > specify the encoding We agree. And it does. [[ No Character Encoding Found! ... Read the FAQ entry on character encoding for more details and pointers on how to fix this problem with your document.]] This message will be shown regardless of whether our attempts to transcode with one, or several, defaults, succeed. > Why those three? Because these are the three encodings that are either declared as a potential default in relevant specifications, or (for utf-8) are a rising default and thus a good encoding to try. In other words, because we can't try all the encodings possible. > Why would you take extra trouble > to make such a guess and ultimately reject a document just because > it is > in one of the many 8-bit encodings that happens to have characters in > positions that make them malformed in any of the encodings you tried? [...] > Exactly what is wrong with the idea of assuming octets 0...127 decimal > to have their Ascii meanings and other octets to constitute data > characters in some unknown encoding? Not knowing what those data > characters are does not harm validation at all. That was discussed earlier. In a nutshell: error display, source display, outline etc. We need to transcode into the output charset of the validator, namely utf-8, or forfeit those features. The latter does not appear to be a option at this point, hence the compromise. -- olivier
Received on Monday, 28 April 2008 06:43:21 UTC