Next message: Liam Quinn: "RE: Character set question"
Date: Wed, 07 Mar 2001 13:50:05 -0700
From: Thanasis Kinias <tkinias@asu.edu>
To: "'Liam Quinn'" <liam@htmlhelp.com>
Cc: www-validator@w3.org, "'Bertilo Wennergren'" <bertilow@hem.passagen.se>
Message-id: <A021872EC2BDD411AB3600902746A055016047B5@mainex4.asu.edu>
Subject: RE: Character set question
Liam Quinn wrote:
> On Wed, 7 Mar 2001, Thanasis Kinias wrote:
[snip]
> > The default
> > charset is UTF-8, which is identical to ISO Latin-1 (ISO 8859-1).
> There is no default charset for HTML, and UTF-8 is not identical to
> ISO-8859-1. UTF-8 and ISO-8859-1 are only identical for the 7-bit
> (US-ASCII) characters.
From the HTML 4.01 recommendation
(<http://www.w3.org/TR/html4/charset.html#h-5.2.2>):
> The HTTP protocol ([RFC2616], section 3.7.1) mentions ISO-8859-1 as a
default
> character encoding when the "charset" parameter is absent from the
> "Content-Type" header field. In practice, this recommendation has proved
> useless because some servers don't allow a "charset" parameter to be sent,
and
> others may not be configured to send the parameter. Therefore, user agents
must
> not assume any default value for the "charset" parameter.
I guess the latter half of this means you really ought to specify the
charset for HTML. I've been working with XHTML so I forgot HTML was
different.
I should have been more clear about saying UTF-8 = ISO Latin-1; I meant for
the lower-128. Of course, you are correct; they are not identical above
U+007F.
> The charset declaration is required for HTML documents, regardless of
> whether you use entities.
If the server properly sends the charset parameter, the <meta> declaration
of charset is redundant. From HTML 4.01:
> To address server or configuration limitations, HTML documents _may_
include
> explicit information about the document's character encoding; the META
> element can be used to provide user agents with this information.
[emphasis added]
If one is only using ASCII characters and the server is sending a charset
value in the header Content-Type field (whether it's sending UTF-8, Latin-1,
or Windows 1252), all is OK vis-à-vis the standards - unless I'm really
misunderstanding "may" in the recommendation.
At any rate, there isn't a compelling reason _not_ to specify with a <meta>.
And, of course, Bertilo is correct about ISO 8859-1 being preferable to a
proprietary standard.
Liam also wrote (in response to Bertilo):
> But it will cause links containing "#" to fail in IE4 for Windows. So
> ISO-8859-1 is still preferred when you don't need characters outside
> ISO-8859-1.
That's _bizarre_, but I guess not altogether surprising. That answers the
question I guess. Is that also a problem with XHTML docs with implicit
(default) UTF-8 encoding?
On this subject, must one then specify a charset with XHTML docs served as
text/html, even if it is the default UTF-8?
Thanasis Kinias
Information Dissemination Team, Information Technology
Arizona State University
Tempe, Ariz., U.S.A.
Qui nos rodunt confundantur
et cum iustis non scribantur.