W3C home > Mailing lists > Public > www-validator@w3.org > February 2007

Re: Validation of apostrophe

From: Jukka K. Korpela <jkorpela@cs.tut.fi>
Date: Wed, 14 Feb 2007 23:14:15 +0200 (EET)
To: www-validator@w3.org
cc: richard Eskins <R.Eskins@mmu.ac.uk>
Message-ID: <Pine.GSO.4.64.0702142305420.28844@mustatilhi.cs.tut.fi>

On Thu, 15 Feb 2007, olivier Thereaux wrote:

>> Then the validator somehow fails to pay attention to the ISO-8859-1 
>> encoding declared in the <meta> tag.
>
> I don't follow your logic here. As you mention, the browser is performing the 
> transcoding,

It's an undocumented feature, as far as I can see, but since it's 
basically something that the browser does, that's understandable. On the 
other hand, utf-8 is still something somewhat unexpected (well, less 
common) on web pages, so maybe the direct input facility should have a 
comment about this.

> so the data is utf-8, and the validator treats is as utf-8, 
> correctly. Because of the transcoding, the charset information in the meta 
> tag becomes completely irrelevant: the validator should not "pay attention" 
> to it.

This is a bit perplexing issue, but your conclusion seems somewhat 
surprising. The meta tag _is_ there, and by HTML specifications, it is to 
be trusted when there is no HTTP header to the contrary. Can we have a 
document declared that way to be ISO-8859-1 encoded, yet containing 
characters that have no representation in that encoding - as raw character 
data?

In practical terms, this results in a confusion: you copy and paste your 
document into the direct input field and get a response saying that the 
document is valid, though it is not. We might distinguish the document as 
residing on disk and the document as submitted via the form, but I'm 
afraid less experienced web page authors will get completely lost and 
virtually all will be surprised if they detect this situation.

Shouldn't the validator at least issue a warning saying that it received 
the document in utf-8 encoding, even though the document declares a 
different character encoding? Admittedly this would be confusing too, even 
to people who have no problems with encodings since they never dreamt of 
anything outside ASCII. :-)

-- 
Jukka "Yucca" Korpela, http://www.cs.tut.fi/~jkorpela/
Received on Wednesday, 14 February 2007 21:14:27 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Wednesday, 25 April 2012 12:14:23 GMT