RE: Character set question

Liam Quinn wrote:

> On Wed, 7 Mar 2001, Thanasis Kinias wrote:
[snip]
> > The default
> > charset is UTF-8, which is identical to ISO Latin-1 (ISO 8859-1).

> There is no default charset for HTML, and UTF-8 is not identical to
> ISO-8859-1.  UTF-8 and ISO-8859-1 are only identical for the 7-bit
> (US-ASCII) characters.

From the HTML 4.01 recommendation
(<http://www.w3.org/TR/html4/charset.html#h-5.2.2>):

> The HTTP protocol ([RFC2616], section 3.7.1) mentions ISO-8859-1 as a
default
> character encoding when the "charset" parameter is absent from the
> "Content-Type" header field. In practice, this recommendation has proved
> useless because some servers don't allow a "charset" parameter to be sent,
and
> others may not be configured to send the parameter. Therefore, user agents
must
> not assume any default value for the "charset" parameter.

I guess the latter half of this means you really ought to specify the
charset for HTML.  I've been working with XHTML so I forgot HTML was
different.

I should have been more clear about saying UTF-8 = ISO Latin-1; I meant for
the lower-128.  Of course, you are correct; they are not identical above
U+007F.

> The charset declaration is required for HTML documents, regardless of
> whether you use entities.

If the server properly sends the charset parameter, the <meta> declaration
of charset is redundant.  From HTML 4.01:

> To address server or configuration limitations, HTML documents _may_
include
> explicit information about the document's character encoding; the META 
> element can be used to provide user agents with this information.
[emphasis added]

If one is only using ASCII characters and the server is sending a charset
value in the header Content-Type field (whether it's sending UTF-8, Latin-1,
or Windows 1252), all is OK vis-à-vis the standards - unless I'm really
misunderstanding "may" in the recommendation.

At any rate, there isn't a compelling reason _not_ to specify with a <meta>.
And, of course, Bertilo is correct about ISO 8859-1 being preferable to a
proprietary standard.

Liam also wrote (in response to Bertilo):

> But it will cause links containing "#" to fail in IE4 for Windows.  So
> ISO-8859-1 is still preferred when you don't need characters outside
> ISO-8859-1.

That's _bizarre_, but I guess not altogether surprising.  That answers the
question I guess.  Is that also a problem with XHTML docs with implicit
(default) UTF-8 encoding?

On this subject, must one then specify a charset with XHTML docs served as
text/html, even if it is the default UTF-8?

Thanasis Kinias
Information Dissemination Team, Information Technology
Arizona State University
Tempe, Ariz., U.S.A.

Qui nos rodunt confundantur
et cum iustis non scribantur.

Received on Wednesday, 7 March 2001 16:01:50 UTC