Re: utf-8 validation help

Frank Ellermann <nobody@xyzzy.claranet.de> writes:

 > Jukka K. Korpela wrote:
 >  
>> They might also be using "free" web space on a server that
>> adds some code on each page sent, making it invalid.
>
 > Yes, that would be a hopeless case.  But RFC 2616 is more
 > tolerant wrt the http header.  If the choice is "no info" vs.
 > "wrong info" I pick the former - some of my plain text files
 > are pc-multilingual-850+euro, no decent Web server could get
 > this right without direct instructions.
>
>> The charset issue is however much less serious
>
 > This got a MAY, a SHOULD, and two MUSTs in 3.4.1 of RFC 2616.
 > And probably my browser belongs to the "unfortunately" cases.
 > Tough.  At least this mess is limited to HTTP/1.0, so that
 > can't confuse the validator.

RFC 2616 is trumped by the HTML spec which states that an absent HTTP
Content-Type header may not be construed as ISO-8859-1.  A user agent
must manage its own default for encoding if none is specified in HTTP
headers, META declaration, or charset attribute.

  http://www.w3.org/TR/html4/charset.html#h-5.2.2

I'd suggest that a validating UA might use US-ASCII as its default
encoding and raise errors for out of range characters.  Of course
there should still be a warning if neither the web server nor document
specify an encoding.  There must be many pages which render exactly
the same if interpreted with an encoding of US-ASCII, ISO-8859-1,
WINDOWS-1252 or UTF-8.

This is for HTML, the rules are different for XHTML.
-- 
Pete Forman                -./\.-  Disclaimer: This post is originated
WesternGeco                  -./\.-   by myself and does not represent
pete.forman@westerngeco.com    -./\.-   the opinion of Schlumberger or
http://petef.port5.com           -./\.-   WesternGeco.

Received on Thursday, 31 August 2006 14:41:14 UTC