Re: flakey charset detection from David Brownell on 2002-12-04 (www-validator@w3.org from December 2002)

From: David Brownell <david-b@pacbell.net>
Date: Wed, 04 Dec 2002 12:15:28 -0800
To: Karl Dubost <karl@w3.org>
Cc: www-validator@w3.org
Message-id: <3DEE6260.5060500@pacbell.net>
Karl Dubost wrote:
> At 8:01 -0800 2002-12-04, David Brownell wrote:
> 
>> I recently validated a xhtml 1.0 page that used to validate just fine, 
>> and
>> instead, I got a message that said things like:
> 
> 
> Could you give an URI of your document?

http://xmlconf.sourceforge.net/xml/

... you'll notice it's "Content-Type: text/html", which is specified
(see http://www.ietf.org/rfc/rfc2854.txt section 6) to mean "iso-8859-1".


>> p.s. Given that it's XHTML, I find the fact that it even _tried_
>>      using the META element to be worrisome ... that means that
>>      parsing this document as XML could give different results,
>>      which breaks all XHTML goals I ever heard.  Not that I've
>>      tracked XHTML recently, but this seems like trouble.
> 
> 
> I put an XHTML 1.0 document encoded as UTF-8
> http://www.w3.org/QA/2002/12/xhtml-utf-8.html
> 
> without Meta or XML Declaration, because XHTML 1.0 is an XML document, 
> so XML document encoded as UTF-8 doesn't need the encoding information.

Wrong -- it's getting delivered in "iso-8859-1" because that web
server has been made to make the HTTP default be explicit.  (Maybe
the defaults have been changed?)

That may be why it validates at all for you, given that the validator
doesn't seem to understand what HTTP means.

> The only problem I see is
>     that the validator does the right job and respect the HTTP header 
> information
> 
> HEAD http://www.w3.org/QA/2002/12/xhtml-utf-8.html
> 200 OK
> Date: Wed, 04 Dec 2002 16:55:58 GMT
> Content-Type: text/html; charset=iso-8859-1

Not a bug.  That's what it's supposed to do.  All HTTP clients are
required to handle that "charset=..." in that way.


> BUT It validates with the wrong encoding. So the validator doesn't check 
> if the document is sent with the right encoding. But I guess in some 
> cases it's a bit tricky to detect.

It's a perfectly valid iso-8859-1 document, but the accented characters
display like garbage in Mozilla ... since utf-8 uses two bytes for them,
while iso-8859-1 uses one byte per character.


> I have the feeling, but I may be wrong that the validator should not 
> validate it :) but even that it's not sure. :)
> 
> in http://www.w3.org/TR/xhtml1/#docconf

See section C1 of that document, which points out one of the relevant
constraints:  namely, that browsers mangle XML declarations, so it's
good not to use them.  The other half is that text fetched over HTTP
is by definition encoded in "iso-8859-1" unless it has a "charset=...".

Combining those issues produces the guideline I usually give:  XHTML
"should" be in ASCII (7bit, a strict subset of UTF-8) unless it's given
an explicit "charset=..." attribute by webserver config.  Using 8859-1.
HTTP accesses to the document would behave, but access through local
filesystems fails (since then pure XML rules apply and it looks like
UTF-8 with broken characters).


> An XML declaration is not required in all XML documents; however XHTML 
> document authors are strongly encouraged to use XML declarations in all 
> their documents. Such a declaration is required when the character 
> encoding of the document is other than the default UTF-8 or UTF-16 and 
> no encoding was determined by a higher-level protocol.

I wonder if that was my original text?  Sounds familiar.  Note that
HTTP is the "higher level protocol" in question, and it has a default
encoding of iso-8859-1 for all "text/*" mime types.

- Dave
Received on Wednesday, 4 December 2002 15:11:01 UTC