RE: Character set question

From: Thanasis Kinias (tkinias@asu.edu)
Date: Wed, Mar 07 2001

  • Next message: Liam Quinn: "RE: Character set question"

    Date: Wed, 07 Mar 2001 13:50:05 -0700
    From: Thanasis Kinias <tkinias@asu.edu>
    To: "'Liam Quinn'" <liam@htmlhelp.com>
    Cc: www-validator@w3.org, "'Bertilo Wennergren'" <bertilow@hem.passagen.se>
    Message-id: <A021872EC2BDD411AB3600902746A055016047B5@mainex4.asu.edu>
    Subject: RE: Character set question
    
    Liam Quinn wrote:
    
    > On Wed, 7 Mar 2001, Thanasis Kinias wrote:
    [snip]
    > > The default
    > > charset is UTF-8, which is identical to ISO Latin-1 (ISO 8859-1).
    
    > There is no default charset for HTML, and UTF-8 is not identical to
    > ISO-8859-1.  UTF-8 and ISO-8859-1 are only identical for the 7-bit
    > (US-ASCII) characters.
    
    From the HTML 4.01 recommendation
    (<http://www.w3.org/TR/html4/charset.html#h-5.2.2>):
    
    > The HTTP protocol ([RFC2616], section 3.7.1) mentions ISO-8859-1 as a
    default
    > character encoding when the "charset" parameter is absent from the
    > "Content-Type" header field. In practice, this recommendation has proved
    > useless because some servers don't allow a "charset" parameter to be sent,
    and
    > others may not be configured to send the parameter. Therefore, user agents
    must
    > not assume any default value for the "charset" parameter.
    
    I guess the latter half of this means you really ought to specify the
    charset for HTML.  I've been working with XHTML so I forgot HTML was
    different.
    
    I should have been more clear about saying UTF-8 = ISO Latin-1; I meant for
    the lower-128.  Of course, you are correct; they are not identical above
    U+007F.
    
    > The charset declaration is required for HTML documents, regardless of
    > whether you use entities.
    
    If the server properly sends the charset parameter, the <meta> declaration
    of charset is redundant.  From HTML 4.01:
    
    > To address server or configuration limitations, HTML documents _may_
    include
    > explicit information about the document's character encoding; the META 
    > element can be used to provide user agents with this information.
    [emphasis added]
    
    If one is only using ASCII characters and the server is sending a charset
    value in the header Content-Type field (whether it's sending UTF-8, Latin-1,
    or Windows 1252), all is OK vis-à-vis the standards - unless I'm really
    misunderstanding "may" in the recommendation.
    
    At any rate, there isn't a compelling reason _not_ to specify with a <meta>.
    And, of course, Bertilo is correct about ISO 8859-1 being preferable to a
    proprietary standard.
    
    Liam also wrote (in response to Bertilo):
    
    > But it will cause links containing "#" to fail in IE4 for Windows.  So
    > ISO-8859-1 is still preferred when you don't need characters outside
    > ISO-8859-1.
    
    That's _bizarre_, but I guess not altogether surprising.  That answers the
    question I guess.  Is that also a problem with XHTML docs with implicit
    (default) UTF-8 encoding?
    
    On this subject, must one then specify a charset with XHTML docs served as
    text/html, even if it is the default UTF-8?
    
    Thanasis Kinias
    Information Dissemination Team, Information Technology
    Arizona State University
    Tempe, Ariz., U.S.A.
    
    Qui nos rodunt confundantur
    et cum iustis non scribantur.